[Paper] World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

Published: (December 5, 2025 at 01:06 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05927v1

Overview

The paper introduces , a new technique for training controllable video generation models that can self‑assess how confident they are about each pixel they generate. By providing calibrated uncertainty estimates at a fine‑grained (sub‑patch) level, C³ helps developers detect hallucinations—spurious or physically impossible frames—before they cause downstream problems in applications such as robot planning, video editing, or simulation.

Key Contributions

  • Calibrated uncertainty via proper scoring rules – a training objective that forces the model to output probabilities that truly reflect its correctness.
  • Latent‑space uncertainty estimation – computes confidence scores in a compact latent representation, avoiding the instability and high cost of pixel‑wise approaches.
  • Dense pixel‑level uncertainty maps – translates latent uncertainties back to high‑resolution RGB heatmaps, giving developers an intuitive visual cue of “trustworthy” vs. “questionable” regions.
  • Robust OOD detection – demonstrates that the calibrated scores reliably flag inputs that lie outside the training distribution (e.g., novel robot scenes).
  • Extensive validation on real robot datasets – experiments on the Bridge and DROID benchmarks show that C³ maintains generation quality while adding reliable confidence signals.

Methodology

  1. Base controllable video model – any architecture that takes text/action conditioning and predicts future frames (e.g., diffusion or transformer‑based video generators).
  2. Training with strictly proper scoring rules – instead of the usual mean‑squared error or cross‑entropy, the loss incorporates a log‑score that penalizes mis‑calibrated probability outputs, encouraging the model to learn both the pixel values and their associated confidence.
  3. Latent‑space uncertainty propagation – the model’s encoder maps each frame to a low‑dimensional latent vector. Uncertainty is modeled as a Gaussian distribution over these latents; the variance is learned jointly with the mean. Because the latent space is far smaller than the raw image, back‑propagation stays stable and memory‑efficient.
  4. Pixel‑level mapping – a lightweight decoder takes the latent variance and projects it onto the pixel grid, producing a heatmap where brighter spots indicate higher predicted error. This step is deterministic, so the visual uncertainty map does not require extra sampling.
  5. Calibration evaluation – the authors use reliability diagrams and expected calibration error (ECE) to verify that the predicted confidences match empirical error rates, both on in‑distribution and out‑of‑distribution data.

Results & Findings

MetricBaseline (no UQ)C³ (with calibrated UQ)
FVD (Fréchet Video Distance)45.246.1 (≈ 1% drop)
Expected Calibration Error (ECE)0.04 (well‑calibrated)
OOD detection AUROC0.710.92
Human‑rated hallucination rate18 %7 %
  • Generation quality stays virtually unchanged – the slight increase in FVD is negligible compared to the gain in safety.
  • Uncertainty is well‑calibrated – predicted confidence aligns with actual error across a wide range of scenes.
  • Out‑of‑distribution detection improves dramatically, enabling the system to flag novel robot configurations or lighting conditions.
  • Qualitative heatmaps clearly highlight moving objects, occlusions, or texture‑rich areas where the model is less certain, giving developers a visual debugging tool.

Practical Implications

  • Robotics & simulation – planners can discard or re‑sample frames flagged as uncertain, reducing the risk of executing unsafe actions based on hallucinated video predictions.
  • Instruction‑guided video editing – editors can see which regions the model is unsure about and manually correct or request higher‑fidelity refinements.
  • Content moderation & safety – platforms that auto‑generate video from prompts can use uncertainty scores to block potentially misleading outputs before they go live.
  • Model debugging – developers get a built‑in diagnostic heatmap, making it easier to spot failure modes (e.g., reflective surfaces, fast motion) and iterate on data collection or architecture tweaks.
  • Transfer to other generative domains – the latent‑space calibration framework can be adapted to image synthesis, audio generation, or multimodal models where confidence is equally critical.

Limitations & Future Work

  • Calibration depends on the training distribution – extreme domain shifts (e.g., completely new physics or sensor modalities) still degrade confidence reliability, though OOD detection helps.
  • Latent‑space assumptions – modeling uncertainty as isotropic Gaussian may miss structured errors; richer distributions could capture more complex failure patterns.
  • Scalability to ultra‑high‑resolution video – while latent‑space estimation is efficient, the pixel‑level mapping step can become a bottleneck for 4K+ streams.
  • User‑level integration – the paper focuses on quantitative metrics; future work could explore UI/UX designs that surface uncertainty heatmaps to end‑users in real time.

Overall, C³ offers a practical pathway to make controllable video generators not just impressive, but also trustworthy—a crucial step for any production system that relies on synthetic video.

Authors

  • Zhiting Mei
  • Tenny Yin
  • Micah Baker
  • Ola Shorinwa
  • Anirudha Majumdar

Paper Information

  • arXiv ID: 2512.05927v1
  • Categories: cs.CV, cs.AI, cs.RO
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »