[Paper] YoCausal: How Far is Video Generation from World Model? A Causality Perspective
Source: arXiv - 2605.30346v1
Overview
The paper “YoCausal: How Far is Video Generation from World Model? A Causality Perspective” asks a simple yet profound question: Do modern video diffusion models (VDMs) really understand cause‑and‑effect, or are they just memorising temporal patterns? To answer this, the authors introduce YoCausal, a two‑level benchmark inspired by the “violation‑of‑expectation” experiments used in developmental psychology. By flipping real‑world videos in time, they create natural counterfactuals that let researchers probe a model’s sense of the arrow of time and its deeper causal reasoning.
Key Contributions
- YoCausal benchmark – a scalable, real‑world evaluation suite that requires no synthetic data generation.
- Reverse Surprise Index (RSI) – a metric that measures how surprised a VDM is by a temporally reversed clip, capturing its perception of temporal directionality.
- Causality Cognition Index (CCI) – a novel two‑step procedure that uses a vision‑language model (VLM) to split videos into causal vs. non‑causal groups, isolating genuine causal understanding from mere temporal bias.
- Comprehensive empirical study – 13 state‑of‑the‑art VDMs are evaluated on YoCausal, revealing that strong arrow‑of‑time detection does not guarantee causal comprehension.
- Human baseline – the authors collect human judgments to quantify the gap between current models and human‑level causal cognition.
Methodology
-
Dataset Construction
- Start with a large collection of everyday videos (e.g., cooking, sports, daily activities).
- For each clip, create a counterfactual version by simply reversing the temporal order. No extra labeling or simulation is needed, keeping the benchmark inexpensive and extensible.
-
Level 1 – Arrow‑of‑Time Evaluation (RSI)
- Feed both the original and reversed clips into a VDM that has been trained to denoise video frames.
- Compute the denoising loss for each direction; a larger loss on the reversed clip indicates the model perceives a temporal “surprise.”
- The Reverse Surprise Index is the normalized difference between the two losses.
-
Level 2 – Causal Reasoning Evaluation (CCI)
- Use a pre‑trained vision‑language model (e.g., CLIP) to score how well a caption describing a causal relationship (e.g., “the ball falls because the hand releases it”) matches each video.
- Split the dataset into causal (high caption‑video alignment) and non‑causal (low alignment) subsets.
- Apply the RSI separately on each subset; the Causality Cognition Index is the gap between the causal and non‑causal RSI scores. A larger gap suggests the VDM is sensitive to genuine cause‑effect rather than just temporal regularities.
-
Human Baseline
- Human participants watch the same original/reversed pairs and rate how “natural” each direction feels. Their average scores provide a reference point for model performance.
Results & Findings
| Model (selected) | RSI (arrow‑of‑time) | CCI (causal gap) | Human CCI |
|---|---|---|---|
| VDM‑A (diffusion‑based) | 0.78 | 0.12 | 0.68 |
| VDM‑B (latent‑diffusion) | 0.81 | 0.09 | — |
| VDM‑C (flow‑based) | 0.73 | 0.05 | — |
| Human | — | 0.68 | — |
- Arrow‑of‑time perception: Most VDMs achieve high RSI scores, meaning they can tell that a video is “backwards.”
- Causal cognition: The CCI values are far below the human baseline, indicating that models treat reversed videos similarly regardless of whether the original clip contains a clear cause‑effect chain.
- Gap analysis: Even the best‑performing VDM only captures ~15 % of the human causal gap, highlighting substantial room for improvement.
Ablation studies show that larger model capacity or longer training does not automatically close the causal gap, suggesting that current diffusion objectives lack an explicit causal signal.
Practical Implications
| Area | Impact |
|---|---|
| Content creation tools | Video editors that rely on diffusion models for in‑painting or frame interpolation may produce temporally plausible but causally inconsistent results (e.g., a ball appearing to bounce before it is thrown). Understanding this limitation can guide UI designs that let users verify or correct causal mismatches. |
| Robotics & simulation | When VDMs are used to generate synthetic training data for robot perception, a lack of causal fidelity could lead to policies that fail in the real world (e.g., mis‑interpreting cause‑effect in manipulation tasks). |
| AI safety & alignment | Causal reasoning is a core component of robust decision‑making. The benchmark provides a concrete way to test whether generative models are merely “pattern‑matching” or truly modeling the underlying physics, informing safety‑critical deployments. |
| Benchmarking & research | YoCausal offers a low‑cost, extensible protocol that can be plugged into existing training pipelines, encouraging the community to incorporate causal objectives (e.g., contrastive forward‑backward losses) early in model development. |
Developers can start using the provided codebase to evaluate their own video models, compare against the published baselines, and iterate on loss functions or architectural tweaks aimed at improving CCI.
Limitations & Future Work
- Scope of causal scenarios: The benchmark relies on textual captions to infer causality, which may miss subtle or multi‑step cause‑effect chains not captured by a single sentence.
- Dependence on VLM quality: The CCI’s separation of causal vs. non‑causal videos hinges on the vision‑language model’s alignment accuracy; biases in the VLM could propagate into the evaluation.
- Temporal granularity: Reversing the entire clip is a coarse counterfactual; finer‑grained manipulations (e.g., swapping sub‑events) could reveal more nuanced causal failures.
- Future directions: The authors suggest integrating explicit causal graph supervision, training with forward‑backward consistency losses, and expanding YoCausal to multimodal (audio‑visual) settings to better approximate real‑world world‑model learning.
Authors
- You‑Zhe Xie
- Yu‑Hsuan Li
- Jie‑Ying Lee
- Kaipeng Zhang
- Yu‑Lun Liu
- Zhixiang Wang
Paper Information
- arXiv ID: 2605.30346v1
- Categories: cs.CV
- Published: May 28, 2026
- PDF: Download PDF