[Paper] Causality in Video Diffusers is Separable from Denoising
Source: arXiv - 2602.10095v1
Overview
The paper “Causality in Video Diffusers is Separable from Denoising” shows that the temporal‑causal reasoning required for video generation can be cleanly split from the heavy, multi‑step denoising loop that powers diffusion models. By decoupling these two concerns, the authors build a faster, lower‑latency video diffusion system that still matches—or even beats—the visual quality of existing causal video generators.
Key Contributions
- Empirical discovery of separability: Demonstrates that early diffusion layers produce almost identical features across denoising steps, while deeper layers focus on intra‑frame rendering with only sparse cross‑frame attention.
- Separable Causal Diffusion (SCD) architecture: Introduces a two‑stage design—(1) a causal transformer encoder that performs once‑per‑frame temporal reasoning, and (2) a lightweight diffusion decoder that handles the multi‑step frame‑wise rendering.
- Efficiency gains: Achieves up to ~2‑3× higher throughput and significantly lower per‑frame latency compared to state‑of‑the‑art causal video diffusion baselines.
- Broad evaluation: Validates SCD on both pre‑training (large synthetic video corpora) and downstream post‑training tasks (real‑world video benchmarks), showing equal or superior generation quality (FID, CLIP‑Score, etc.).
- Open‑source tooling: Provides code and pretrained checkpoints that can be dropped into existing video generation pipelines.
Methodology
- Probing existing autoregressive video diffusers – The authors instrumented popular causal diffusion models and measured feature similarity across denoising timesteps. They observed that the first few layers change very little, indicating redundant computation.
- Analyzing attention patterns – By visualizing attention maps, they found that deeper layers attend sparsely across frames, mainly concentrating on the current frame’s pixels.
- Designing SCD –
- Causal Transformer Encoder: Takes the full video sequence (or a sliding window) as input and computes a single set of temporally‑aware embeddings per frame, respecting the uni‑directional cause‑effect constraint.
- Diffusion Decoder: A shallow, frame‑wise UNet‑style network that runs the standard diffusion denoising steps independently for each frame, using the encoder’s embeddings as conditioning. Because the encoder runs only once, the expensive temporal reasoning is not repeated at every denoising step.
- Training regime – The encoder and decoder are trained jointly with a standard diffusion loss, but the encoder’s parameters are frozen after a short “causal pre‑training” phase to stabilize temporal representations.
- Evaluation – The model is benchmarked on synthetic datasets (e.g., Moving MNIST, CLEVR‑Video) and real video corpora (e.g., Kinetics‑600, UCF‑101) using both quantitative metrics and human preference studies.
Results & Findings
| Metric | Baseline Causal Diffuser | SCD (ours) |
|---|---|---|
| Throughput (frames / s) | 4.2 | 9.8 (+133 %) |
| Per‑frame latency | 240 ms | 92 ms (‑62 %) |
| FID (lower is better) | 28.4 | 27.9 |
| CLIP‑Score (higher is better) | 0.71 | 0.73 |
| Human preference | 48 % | 52 % |
- Quality: SCD matches or slightly exceeds the visual fidelity of the strongest causal diffusion baselines across all datasets.
- Speed: Because temporal reasoning is performed once per frame, the overall generation pipeline is more than twice as fast, with latency dropping below the 100 ms threshold that many interactive applications target.
- Scalability: Experiments scaling the number of frames (up to 64) show that SCD’s runtime grows linearly with frame count, whereas the baseline’s cost grows super‑linearly due to repeated cross‑frame attention.
Practical Implications
- Real‑time video synthesis: The reduced latency makes SCD viable for interactive tools such as AI‑assisted video editing, live‑stream overlays, or game asset generation where sub‑100 ms response times are critical.
- Edge deployment: The lightweight decoder can run on consumer GPUs or even high‑end mobile chips, while the encoder can be offloaded to a server or executed once and cached for repeated renders.
- Modular pipelines: Since the temporal encoder is decoupled, developers can swap in alternative causal transformers (e.g., with larger context windows or domain‑specific pre‑training) without retraining the diffusion decoder.
- Cost savings: Faster throughput directly translates to lower cloud‑compute bills for large‑scale video generation services (e.g., synthetic data creation for training autonomous‑driving models).
- Research reuse: The clear separation of concerns provides a clean test‑bed for studying causal reasoning in other generative domains like audio or text‑to‑video models.
Limitations & Future Work
- Encoder freezing: The current training recipe freezes the encoder after a short pre‑training phase, which may limit the model’s ability to adapt temporal representations for highly diverse video domains.
- Long‑range dependencies: While SCD handles moderate sequence lengths efficiently, extremely long videos (> 200 frames) still suffer from memory constraints in the transformer encoder.
- Domain generalization: The paper focuses on relatively clean benchmarks; performance on highly noisy, real‑world footage (e.g., handheld camera shake) remains to be explored.
- Future directions: The authors suggest (1) integrating memory‑efficient attention variants to push sequence length limits, (2) jointly fine‑tuning encoder and decoder with curriculum learning, and (3) extending the separable paradigm to multimodal diffusion (e.g., video‑plus‑audio generation).
Authors
- Xingjian Bai
- Guande He
- Zhengqi Li
- Eli Shechtman
- Xun Huang
- Zongze Wu
Paper Information
- arXiv ID: 2602.10095v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: February 10, 2026
- PDF: Download PDF