[Paper] Causality in Video Diffusers is Separable from Denoising

Published: 2 days ago (February 10, 2026 at 01:57 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.10095v1

Overview

The paper “Causality in Video Diffusers is Separable from Denoising” shows that the temporal‑causal reasoning required for video generation can be cleanly split from the heavy, multi‑step denoising loop that powers diffusion models. By decoupling these two concerns, the authors build a faster, lower‑latency video diffusion system that still matches—or even beats—the visual quality of existing causal video generators.

Key Contributions

Empirical discovery of separability: Demonstrates that early diffusion layers produce almost identical features across denoising steps, while deeper layers focus on intra‑frame rendering with only sparse cross‑frame attention.
Separable Causal Diffusion (SCD) architecture: Introduces a two‑stage design—(1) a causal transformer encoder that performs once‑per‑frame temporal reasoning, and (2) a lightweight diffusion decoder that handles the multi‑step frame‑wise rendering.
Efficiency gains: Achieves up to ~2‑3× higher throughput and significantly lower per‑frame latency compared to state‑of‑the‑art causal video diffusion baselines.
Broad evaluation: Validates SCD on both pre‑training (large synthetic video corpora) and downstream post‑training tasks (real‑world video benchmarks), showing equal or superior generation quality (FID, CLIP‑Score, etc.).
Open‑source tooling: Provides code and pretrained checkpoints that can be dropped into existing video generation pipelines.

Methodology

Probing existing autoregressive video diffusers – The authors instrumented popular causal diffusion models and measured feature similarity across denoising timesteps. They observed that the first few layers change very little, indicating redundant computation.
Analyzing attention patterns – By visualizing attention maps, they found that deeper layers attend sparsely across frames, mainly concentrating on the current frame’s pixels.
Designing SCD –
- Causal Transformer Encoder: Takes the full video sequence (or a sliding window) as input and computes a single set of temporally‑aware embeddings per frame, respecting the uni‑directional cause‑effect constraint.
- Diffusion Decoder: A shallow, frame‑wise UNet‑style network that runs the standard diffusion denoising steps independently for each frame, using the encoder’s embeddings as conditioning. Because the encoder runs only once, the expensive temporal reasoning is not repeated at every denoising step.
Training regime – The encoder and decoder are trained jointly with a standard diffusion loss, but the encoder’s parameters are frozen after a short “causal pre‑training” phase to stabilize temporal representations.
Evaluation – The model is benchmarked on synthetic datasets (e.g., Moving MNIST, CLEVR‑Video) and real video corpora (e.g., Kinetics‑600, UCF‑101) using both quantitative metrics and human preference studies.

Results & Findings

Metric	Baseline Causal Diffuser	SCD (ours)
Throughput (frames / s)	4.2	9.8 (+133 %)
Per‑frame latency	240 ms	92 ms (‑62 %)
FID (lower is better)	28.4	27.9
CLIP‑Score (higher is better)	0.71	0.73
Human preference	48 %	52 %

Quality: SCD matches or slightly exceeds the visual fidelity of the strongest causal diffusion baselines across all datasets.
Speed: Because temporal reasoning is performed once per frame, the overall generation pipeline is more than twice as fast, with latency dropping below the 100 ms threshold that many interactive applications target.
Scalability: Experiments scaling the number of frames (up to 64) show that SCD’s runtime grows linearly with frame count, whereas the baseline’s cost grows super‑linearly due to repeated cross‑frame attention.

Practical Implications

Real‑time video synthesis: The reduced latency makes SCD viable for interactive tools such as AI‑assisted video editing, live‑stream overlays, or game asset generation where sub‑100 ms response times are critical.
Edge deployment: The lightweight decoder can run on consumer GPUs or even high‑end mobile chips, while the encoder can be offloaded to a server or executed once and cached for repeated renders.
Modular pipelines: Since the temporal encoder is decoupled, developers can swap in alternative causal transformers (e.g., with larger context windows or domain‑specific pre‑training) without retraining the diffusion decoder.
Cost savings: Faster throughput directly translates to lower cloud‑compute bills for large‑scale video generation services (e.g., synthetic data creation for training autonomous‑driving models).
Research reuse: The clear separation of concerns provides a clean test‑bed for studying causal reasoning in other generative domains like audio or text‑to‑video models.

Limitations & Future Work

Encoder freezing: The current training recipe freezes the encoder after a short pre‑training phase, which may limit the model’s ability to adapt temporal representations for highly diverse video domains.
Long‑range dependencies: While SCD handles moderate sequence lengths efficiently, extremely long videos (> 200 frames) still suffer from memory constraints in the transformer encoder.
Domain generalization: The paper focuses on relatively clean benchmarks; performance on highly noisy, real‑world footage (e.g., handheld camera shake) remains to be explored.
Future directions: The authors suggest (1) integrating memory‑efficient attention variants to push sequence length limits, (2) jointly fine‑tuning encoder and decoder with curriculum learning, and (3) extending the separable paradigm to multimodal diffusion (e.g., video‑plus‑audio generation).

Authors

Xingjian Bai
Guande He
Zhengqi Li
Eli Shechtman
Xun Huang
Zongze Wu

Paper Information

arXiv ID: 2602.10095v1
Categories: cs.CV, cs.AI, cs.LG
Published: February 10, 2026
PDF: Download PDF

[Paper] Causality in Video Diffusers is Separable from Denoising

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

[Paper] Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training