[Paper] YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Published: 1 week ago (May 28, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.30346v1

Overview

The paper “YoCausal: How Far is Video Generation from World Model? A Causality Perspective” asks a simple yet profound question: Do modern video diffusion models (VDMs) really understand cause‑and‑effect, or are they just memorising temporal patterns? To answer this, the authors introduce YoCausal, a two‑level benchmark inspired by the “violation‑of‑expectation” experiments used in developmental psychology. By flipping real‑world videos in time, they create natural counterfactuals that let researchers probe a model’s sense of the arrow of time and its deeper causal reasoning.

Key Contributions

YoCausal benchmark – a scalable, real‑world evaluation suite that requires no synthetic data generation.
Reverse Surprise Index (RSI) – a metric that measures how surprised a VDM is by a temporally reversed clip, capturing its perception of temporal directionality.
Causality Cognition Index (CCI) – a novel two‑step procedure that uses a vision‑language model (VLM) to split videos into causal vs. non‑causal groups, isolating genuine causal understanding from mere temporal bias.
Comprehensive empirical study – 13 state‑of‑the‑art VDMs are evaluated on YoCausal, revealing that strong arrow‑of‑time detection does not guarantee causal comprehension.
Human baseline – the authors collect human judgments to quantify the gap between current models and human‑level causal cognition.

Methodology

Dataset Construction
- Start with a large collection of everyday videos (e.g., cooking, sports, daily activities).
- For each clip, create a counterfactual version by simply reversing the temporal order. No extra labeling or simulation is needed, keeping the benchmark inexpensive and extensible.
Level 1 – Arrow‑of‑Time Evaluation (RSI)
- Feed both the original and reversed clips into a VDM that has been trained to denoise video frames.
- Compute the denoising loss for each direction; a larger loss on the reversed clip indicates the model perceives a temporal “surprise.”
- The Reverse Surprise Index is the normalized difference between the two losses.
Level 2 – Causal Reasoning Evaluation (CCI)
- Use a pre‑trained vision‑language model (e.g., CLIP) to score how well a caption describing a causal relationship (e.g., “the ball falls because the hand releases it”) matches each video.
- Split the dataset into causal (high caption‑video alignment) and non‑causal (low alignment) subsets.
- Apply the RSI separately on each subset; the Causality Cognition Index is the gap between the causal and non‑causal RSI scores. A larger gap suggests the VDM is sensitive to genuine cause‑effect rather than just temporal regularities.
Human Baseline
- Human participants watch the same original/reversed pairs and rate how “natural” each direction feels. Their average scores provide a reference point for model performance.

Results & Findings

Model (selected)	RSI (arrow‑of‑time)	CCI (causal gap)	Human CCI
VDM‑A (diffusion‑based)	0.78	0.12	0.68
VDM‑B (latent‑diffusion)	0.81	0.09	—
VDM‑C (flow‑based)	0.73	0.05	—
Human	—	0.68	—

Arrow‑of‑time perception: Most VDMs achieve high RSI scores, meaning they can tell that a video is “backwards.”
Causal cognition: The CCI values are far below the human baseline, indicating that models treat reversed videos similarly regardless of whether the original clip contains a clear cause‑effect chain.
Gap analysis: Even the best‑performing VDM only captures ~15 % of the human causal gap, highlighting substantial room for improvement.

Ablation studies show that larger model capacity or longer training does not automatically close the causal gap, suggesting that current diffusion objectives lack an explicit causal signal.

Practical Implications

Area	Impact
Content creation tools	Video editors that rely on diffusion models for in‑painting or frame interpolation may produce temporally plausible but causally inconsistent results (e.g., a ball appearing to bounce before it is thrown). Understanding this limitation can guide UI designs that let users verify or correct causal mismatches.
Robotics & simulation	When VDMs are used to generate synthetic training data for robot perception, a lack of causal fidelity could lead to policies that fail in the real world (e.g., mis‑interpreting cause‑effect in manipulation tasks).
AI safety & alignment	Causal reasoning is a core component of robust decision‑making. The benchmark provides a concrete way to test whether generative models are merely “pattern‑matching” or truly modeling the underlying physics, informing safety‑critical deployments.
Benchmarking & research	YoCausal offers a low‑cost, extensible protocol that can be plugged into existing training pipelines, encouraging the community to incorporate causal objectives (e.g., contrastive forward‑backward losses) early in model development.

Developers can start using the provided codebase to evaluate their own video models, compare against the published baselines, and iterate on loss functions or architectural tweaks aimed at improving CCI.

Limitations & Future Work

Scope of causal scenarios: The benchmark relies on textual captions to infer causality, which may miss subtle or multi‑step cause‑effect chains not captured by a single sentence.
Dependence on VLM quality: The CCI’s separation of causal vs. non‑causal videos hinges on the vision‑language model’s alignment accuracy; biases in the VLM could propagate into the evaluation.
Temporal granularity: Reversing the entire clip is a coarse counterfactual; finer‑grained manipulations (e.g., swapping sub‑events) could reveal more nuanced causal failures.
Future directions: The authors suggest integrating explicit causal graph supervision, training with forward‑backward consistency losses, and expanding YoCausal to multimodal (audio‑visual) settings to better approximate real‑world world‑model learning.

Authors

You‑Zhe Xie
Yu‑Hsuan Li
Jie‑Ying Lee
Kaipeng Zhang
Yu‑Lun Liu
Zhixiang Wang

Paper Information

arXiv ID: 2605.30346v1
Categories: cs.CV
Published: May 28, 2026
PDF: Download PDF

[Paper] YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

[Paper] KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

[Paper] TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

[Paper] Vision-Language Models Suppress Female Representations Under Ambiguous Input