[Paper] World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty
Source: arXiv - 2512.05927v1
Overview
The paper introduces C³, a new technique for training controllable video generation models that can self‑assess how confident they are about each pixel they generate. By providing calibrated uncertainty estimates at a fine‑grained (sub‑patch) level, C³ helps developers detect hallucinations—spurious or physically impossible frames—before they cause downstream problems in applications such as robot planning, video editing, or simulation.
Key Contributions
- Calibrated uncertainty via proper scoring rules – a training objective that forces the model to output probabilities that truly reflect its correctness.
- Latent‑space uncertainty estimation – computes confidence scores in a compact latent representation, avoiding the instability and high cost of pixel‑wise approaches.
- Dense pixel‑level uncertainty maps – translates latent uncertainties back to high‑resolution RGB heatmaps, giving developers an intuitive visual cue of “trustworthy” vs. “questionable” regions.
- Robust OOD detection – demonstrates that the calibrated scores reliably flag inputs that lie outside the training distribution (e.g., novel robot scenes).
- Extensive validation on real robot datasets – experiments on the Bridge and DROID benchmarks show that C³ maintains generation quality while adding reliable confidence signals.
Methodology
- Base controllable video model – any architecture that takes text/action conditioning and predicts future frames (e.g., diffusion or transformer‑based video generators).
- Training with strictly proper scoring rules – instead of the usual mean‑squared error or cross‑entropy, the loss incorporates a log‑score that penalizes mis‑calibrated probability outputs, encouraging the model to learn both the pixel values and their associated confidence.
- Latent‑space uncertainty propagation – the model’s encoder maps each frame to a low‑dimensional latent vector. Uncertainty is modeled as a Gaussian distribution over these latents; the variance is learned jointly with the mean. Because the latent space is far smaller than the raw image, back‑propagation stays stable and memory‑efficient.
- Pixel‑level mapping – a lightweight decoder takes the latent variance and projects it onto the pixel grid, producing a heatmap where brighter spots indicate higher predicted error. This step is deterministic, so the visual uncertainty map does not require extra sampling.
- Calibration evaluation – the authors use reliability diagrams and expected calibration error (ECE) to verify that the predicted confidences match empirical error rates, both on in‑distribution and out‑of‑distribution data.
Results & Findings
| Metric | Baseline (no UQ) | C³ (with calibrated UQ) |
|---|---|---|
| FVD (Fréchet Video Distance) | 45.2 | 46.1 (≈ 1% drop) |
| Expected Calibration Error (ECE) | – | 0.04 (well‑calibrated) |
| OOD detection AUROC | 0.71 | 0.92 |
| Human‑rated hallucination rate | 18 % | 7 % |
- Generation quality stays virtually unchanged – the slight increase in FVD is negligible compared to the gain in safety.
- Uncertainty is well‑calibrated – predicted confidence aligns with actual error across a wide range of scenes.
- Out‑of‑distribution detection improves dramatically, enabling the system to flag novel robot configurations or lighting conditions.
- Qualitative heatmaps clearly highlight moving objects, occlusions, or texture‑rich areas where the model is less certain, giving developers a visual debugging tool.
Practical Implications
- Robotics & simulation – planners can discard or re‑sample frames flagged as uncertain, reducing the risk of executing unsafe actions based on hallucinated video predictions.
- Instruction‑guided video editing – editors can see which regions the model is unsure about and manually correct or request higher‑fidelity refinements.
- Content moderation & safety – platforms that auto‑generate video from prompts can use uncertainty scores to block potentially misleading outputs before they go live.
- Model debugging – developers get a built‑in diagnostic heatmap, making it easier to spot failure modes (e.g., reflective surfaces, fast motion) and iterate on data collection or architecture tweaks.
- Transfer to other generative domains – the latent‑space calibration framework can be adapted to image synthesis, audio generation, or multimodal models where confidence is equally critical.
Limitations & Future Work
- Calibration depends on the training distribution – extreme domain shifts (e.g., completely new physics or sensor modalities) still degrade confidence reliability, though OOD detection helps.
- Latent‑space assumptions – modeling uncertainty as isotropic Gaussian may miss structured errors; richer distributions could capture more complex failure patterns.
- Scalability to ultra‑high‑resolution video – while latent‑space estimation is efficient, the pixel‑level mapping step can become a bottleneck for 4K+ streams.
- User‑level integration – the paper focuses on quantitative metrics; future work could explore UI/UX designs that surface uncertainty heatmaps to end‑users in real time.
Overall, C³ offers a practical pathway to make controllable video generators not just impressive, but also trustworthy—a crucial step for any production system that relies on synthetic video.
Authors
- Zhiting Mei
- Tenny Yin
- Micah Baker
- Ola Shorinwa
- Anirudha Majumdar
Paper Information
- arXiv ID: 2512.05927v1
- Categories: cs.CV, cs.AI, cs.RO
- Published: December 5, 2025
- PDF: Download PDF