[Paper] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Published: (December 24, 2025 at 11:46 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21276v1

Overview

The paper introduces GriDiT, a novel diffusion‑based framework that treats long image sequences as a factorized grid rather than a monolithic 3‑D tensor. By first generating a low‑resolution “coarse” video grid and then super‑resolving each frame independently, the authors achieve higher visual quality, better temporal coherence, and up to 2× faster inference compared with existing video‑diffusion models.

Key Contributions

  • Grid‑based factorization: Reformulates video generation as a 2‑D image diffusion problem on a spatial‑temporal grid, eliminating the need for custom 3‑D architectures.
  • Two‑stage pipeline:
    1. Coarse‑grid diffusion using the Diffusion Transformer (DiT) to capture inter‑frame relationships.
    2. Frame‑wise super‑resolution that injects high‑frequency details without affecting temporal consistency.
  • Data‑efficient training: Learns from subsampled frame grids, reducing the amount of required video data while still handling arbitrary‑length sequences.
  • Broad domain generalization: Works out‑of‑the‑box on diverse datasets (e.g., human motion, natural scenes) without extra priors or supervision.
  • Empirical superiority: Sets new state‑of‑the‑art (SoTA) on multiple benchmarks in terms of FVD, IS, and user‑study ratings, while halving generation latency.

Methodology

  1. Grid Construction – A video of T frames is down‑sampled both temporally and spatially, producing a low‑resolution grid of shape (H′ × W′ × T′). Each cell of the grid is a tiny image patch representing a subsampled frame.
  2. Diffusion Transformer (DiT) Backbone – The same DiT architecture used for 2‑D image diffusion is applied directly to the grid. Self‑attention operates across the flattened grid tokens, allowing the model to learn temporal dependencies without any explicit 3‑D convolutions.
  3. Coarse Generation – The diffusion process denoises a random grid into a plausible low‑resolution video. Because the grid is small, the diffusion steps are cheap and the model can be trained on modest GPU memory.
  4. Frame‑wise Super‑Resolution – Each generated low‑res frame is fed to a dedicated super‑resolution diffusion model (or a deterministic upsampler). Since frames are processed independently, high‑frequency textures are added without breaking the temporal consistency already established by the coarse stage.
  5. Arbitrary Length Extension – The grid can be padded or truncated, enabling generation of videos longer than those seen during training; the DiT’s attention naturally scales to the new temporal dimension.

Results & Findings

DatasetMetric (lower = better)GriDiTPrior SoTA (e.g., Video Diffusion, Make‑It‑3D)
Kinetics‑600FVD68112
UCF‑101IS (higher = better)9.47.8
Human3.6MPose‑consistency (°)2.13.7
Inference latency (per 16‑frame clip)0.21 s (≈2× faster)0.42 s
  • Visual quality: Samples show sharper edges, more realistic motion blur, and fewer flickering artifacts.
  • Temporal coherence: The attention‑driven coarse stage preserves motion trajectories, which the frame‑wise upsampler does not disturb.
  • Scalability: Experiments with up to 128‑frame sequences demonstrate stable generation quality, confirming the method’s ability to handle long videos.

Practical Implications

  • Faster prototyping for video‑centric products – Developers can integrate GriDiT into pipelines for synthetic video data (e.g., training autonomous‑driving perception models) with half the compute budget.
  • Content creation tools – The two‑stage design fits well with existing image‑to‑image upscalers, enabling plug‑and‑play extensions for video editors, game asset pipelines, or AR/VR content generators.
  • Low‑resource environments – Because the coarse diffusion operates on a tiny grid, training and inference can run on a single high‑end GPU, opening the door for on‑device or edge‑based generation.
  • Domain‑agnostic generation – No need for specialized motion priors or pose annotations; the same model can be fine‑tuned on medical imaging sequences, satellite timelapses, or animated UI mockups.

Limitations & Future Work

  • Super‑resolution independence – While frame‑wise upsampling preserves temporal consistency, it cannot inject motion‑aware high‑frequency details (e.g., motion‑blur that varies across frames).
  • Resolution trade‑off – The coarse grid’s spatial resolution caps the finest motion that can be captured; extremely fast motions may still appear blurred.
  • Training data bias – Datasets with highly irregular frame rates or extreme aspect ratios require additional preprocessing.
  • Future directions suggested include:
    1. Joint spatio‑temporal super‑resolution to model motion‑dependent textures.
    2. Adaptive grid sizing that dynamically allocates more tokens to complex scenes.
    3. Integration with conditional controls (text, audio) for guided video synthesis.

Authors

  • Snehal Singh Tomar
  • Alexandros Graikos
  • Arjun Krishna
  • Dimitris Samaras
  • Klaus Mueller

Paper Information

  • arXiv ID: 2512.21276v1
  • Categories: cs.CV
  • Published: December 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »