[Paper] World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

Published: 2 months ago (December 5, 2025 at 01:06 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05927v1

Overview

The paper introduces C³, a new technique for training controllable video generation models that can self‑assess how confident they are about each pixel they generate. By providing calibrated uncertainty estimates at a fine‑grained (sub‑patch) level, C³ helps developers detect hallucinations—spurious or physically impossible frames—before they cause downstream problems in applications such as robot planning, video editing, or simulation.

Key Contributions

Calibrated uncertainty via proper scoring rules – a training objective that forces the model to output probabilities that truly reflect its correctness.
Latent‑space uncertainty estimation – computes confidence scores in a compact latent representation, avoiding the instability and high cost of pixel‑wise approaches.
Dense pixel‑level uncertainty maps – translates latent uncertainties back to high‑resolution RGB heatmaps, giving developers an intuitive visual cue of “trustworthy” vs. “questionable” regions.
Robust OOD detection – demonstrates that the calibrated scores reliably flag inputs that lie outside the training distribution (e.g., novel robot scenes).
Extensive validation on real robot datasets – experiments on the Bridge and DROID benchmarks show that C³ maintains generation quality while adding reliable confidence signals.

Methodology

Base controllable video model – any architecture that takes text/action conditioning and predicts future frames (e.g., diffusion or transformer‑based video generators).
Training with strictly proper scoring rules – instead of the usual mean‑squared error or cross‑entropy, the loss incorporates a log‑score that penalizes mis‑calibrated probability outputs, encouraging the model to learn both the pixel values and their associated confidence.
Latent‑space uncertainty propagation – the model’s encoder maps each frame to a low‑dimensional latent vector. Uncertainty is modeled as a Gaussian distribution over these latents; the variance is learned jointly with the mean. Because the latent space is far smaller than the raw image, back‑propagation stays stable and memory‑efficient.
Pixel‑level mapping – a lightweight decoder takes the latent variance and projects it onto the pixel grid, producing a heatmap where brighter spots indicate higher predicted error. This step is deterministic, so the visual uncertainty map does not require extra sampling.
Calibration evaluation – the authors use reliability diagrams and expected calibration error (ECE) to verify that the predicted confidences match empirical error rates, both on in‑distribution and out‑of‑distribution data.

Results & Findings

Metric	Baseline (no UQ)	C³ (with calibrated UQ)
FVD (Fréchet Video Distance)	45.2	46.1 (≈ 1% drop)
Expected Calibration Error (ECE)	–	0.04 (well‑calibrated)
OOD detection AUROC	0.71	0.92
Human‑rated hallucination rate	18 %	7 %

Generation quality stays virtually unchanged – the slight increase in FVD is negligible compared to the gain in safety.
Uncertainty is well‑calibrated – predicted confidence aligns with actual error across a wide range of scenes.
Out‑of‑distribution detection improves dramatically, enabling the system to flag novel robot configurations or lighting conditions.
Qualitative heatmaps clearly highlight moving objects, occlusions, or texture‑rich areas where the model is less certain, giving developers a visual debugging tool.

Practical Implications

Robotics & simulation – planners can discard or re‑sample frames flagged as uncertain, reducing the risk of executing unsafe actions based on hallucinated video predictions.
Instruction‑guided video editing – editors can see which regions the model is unsure about and manually correct or request higher‑fidelity refinements.
Content moderation & safety – platforms that auto‑generate video from prompts can use uncertainty scores to block potentially misleading outputs before they go live.
Model debugging – developers get a built‑in diagnostic heatmap, making it easier to spot failure modes (e.g., reflective surfaces, fast motion) and iterate on data collection or architecture tweaks.
Transfer to other generative domains – the latent‑space calibration framework can be adapted to image synthesis, audio generation, or multimodal models where confidence is equally critical.

Limitations & Future Work

Calibration depends on the training distribution – extreme domain shifts (e.g., completely new physics or sensor modalities) still degrade confidence reliability, though OOD detection helps.
Latent‑space assumptions – modeling uncertainty as isotropic Gaussian may miss structured errors; richer distributions could capture more complex failure patterns.
Scalability to ultra‑high‑resolution video – while latent‑space estimation is efficient, the pixel‑level mapping step can become a bottleneck for 4K+ streams.
User‑level integration – the paper focuses on quantitative metrics; future work could explore UI/UX designs that surface uncertainty heatmaps to end‑users in real time.

Overall, C³ offers a practical pathway to make controllable video generators not just impressive, but also trustworthy—a crucial step for any production system that relies on synthetic video.

Authors

Zhiting Mei
Tenny Yin
Micah Baker
Ola Shorinwa
Anirudha Majumdar

Paper Information

arXiv ID: 2512.05927v1
Categories: cs.CV, cs.AI, cs.RO
Published: December 5, 2025
PDF: Download PDF

[Paper] World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] Measuring the Effect of Background on Classification and Feature Importance in Deep Learning for AV Perception