[Paper] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Published: 1 month ago (December 24, 2025 at 11:46 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21276v1

Overview

The paper introduces GriDiT, a novel diffusion‑based framework that treats long image sequences as a factorized grid rather than a monolithic 3‑D tensor. By first generating a low‑resolution “coarse” video grid and then super‑resolving each frame independently, the authors achieve higher visual quality, better temporal coherence, and up to 2× faster inference compared with existing video‑diffusion models.

Key Contributions

Grid‑based factorization: Reformulates video generation as a 2‑D image diffusion problem on a spatial‑temporal grid, eliminating the need for custom 3‑D architectures.
Two‑stage pipeline:
1. Coarse‑grid diffusion using the Diffusion Transformer (DiT) to capture inter‑frame relationships.
2. Frame‑wise super‑resolution that injects high‑frequency details without affecting temporal consistency.
Data‑efficient training: Learns from subsampled frame grids, reducing the amount of required video data while still handling arbitrary‑length sequences.
Broad domain generalization: Works out‑of‑the‑box on diverse datasets (e.g., human motion, natural scenes) without extra priors or supervision.
Empirical superiority: Sets new state‑of‑the‑art (SoTA) on multiple benchmarks in terms of FVD, IS, and user‑study ratings, while halving generation latency.

Methodology

Grid Construction – A video of T frames is down‑sampled both temporally and spatially, producing a low‑resolution grid of shape (H′ × W′ × T′). Each cell of the grid is a tiny image patch representing a subsampled frame.
Diffusion Transformer (DiT) Backbone – The same DiT architecture used for 2‑D image diffusion is applied directly to the grid. Self‑attention operates across the flattened grid tokens, allowing the model to learn temporal dependencies without any explicit 3‑D convolutions.
Coarse Generation – The diffusion process denoises a random grid into a plausible low‑resolution video. Because the grid is small, the diffusion steps are cheap and the model can be trained on modest GPU memory.
Frame‑wise Super‑Resolution – Each generated low‑res frame is fed to a dedicated super‑resolution diffusion model (or a deterministic upsampler). Since frames are processed independently, high‑frequency textures are added without breaking the temporal consistency already established by the coarse stage.
Arbitrary Length Extension – The grid can be padded or truncated, enabling generation of videos longer than those seen during training; the DiT’s attention naturally scales to the new temporal dimension.

Results & Findings

Dataset	Metric (lower = better)	GriDiT	Prior SoTA (e.g., Video Diffusion, Make‑It‑3D)
Kinetics‑600	FVD	68	112
UCF‑101	IS (higher = better)	9.4	7.8
Human3.6M	Pose‑consistency (°)	2.1	3.7
Inference latency (per 16‑frame clip)	—	0.21 s (≈2× faster)	0.42 s

Visual quality: Samples show sharper edges, more realistic motion blur, and fewer flickering artifacts.
Temporal coherence: The attention‑driven coarse stage preserves motion trajectories, which the frame‑wise upsampler does not disturb.
Scalability: Experiments with up to 128‑frame sequences demonstrate stable generation quality, confirming the method’s ability to handle long videos.

Practical Implications

Faster prototyping for video‑centric products – Developers can integrate GriDiT into pipelines for synthetic video data (e.g., training autonomous‑driving perception models) with half the compute budget.
Content creation tools – The two‑stage design fits well with existing image‑to‑image upscalers, enabling plug‑and‑play extensions for video editors, game asset pipelines, or AR/VR content generators.
Low‑resource environments – Because the coarse diffusion operates on a tiny grid, training and inference can run on a single high‑end GPU, opening the door for on‑device or edge‑based generation.
Domain‑agnostic generation – No need for specialized motion priors or pose annotations; the same model can be fine‑tuned on medical imaging sequences, satellite timelapses, or animated UI mockups.

Limitations & Future Work

Super‑resolution independence – While frame‑wise upsampling preserves temporal consistency, it cannot inject motion‑aware high‑frequency details (e.g., motion‑blur that varies across frames).
Resolution trade‑off – The coarse grid’s spatial resolution caps the finest motion that can be captured; extremely fast motions may still appear blurred.
Training data bias – Datasets with highly irregular frame rates or extreme aspect ratios require additional preprocessing.
Future directions suggested include:
1. Joint spatio‑temporal super‑resolution to model motion‑dependent textures.
2. Adaptive grid sizing that dynamically allocates more tokens to complex scenes.
3. Integration with conditional controls (text, audio) for guided video synthesis.

Authors

Snehal Singh Tomar
Alexandros Graikos
Arjun Krishna
Dimitris Samaras
Klaus Mueller

Paper Information

arXiv ID: 2512.21276v1
Categories: cs.CV
Published: December 24, 2025
PDF: Download PDF

[Paper] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model