[Paper] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Published: (December 11, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.10958v1

Overview

The paper WorldLens tackles a growing blind spot in generative driving world models: while they can produce photorealistic scenes, they often stumble on geometry, physics, and controllability. To bring rigor to this space, the authors present a comprehensive benchmark that evaluates models across the entire “world‑building” pipeline—from raw video generation to downstream autonomous‑driving tasks—paired with a large human‑annotated dataset and a learned evaluation agent.

Key Contributions

  • WorldLens Benchmark – a five‑dimensional suite (Generation, Reconstruction, Action‑Following, Downstream Task, Human Preference) that jointly measures visual realism, geometric consistency, physical plausibility, and functional reliability.
  • WorldLens‑26K Dataset – 26,000 human‑rated driving videos with numeric scores and textual rationales, covering a wide range of failure modes.
  • WorldLens‑Agent – a distilled evaluation model trained on the human annotations, capable of providing scalable, explainable scores for new world‑model outputs.
  • Comprehensive Empirical Study – systematic evaluation of several state‑of‑the‑art generative world models, revealing trade‑offs (e.g., texture quality vs. physics fidelity).
  • Open‑source Ecosystem – benchmark code, dataset, and evaluation agent are released to encourage reproducibility and future extensions.

Methodology

Defining Evaluation Axes

  • Generation: assesses raw video quality (e.g., sharpness, texture realism).
  • Reconstruction: checks whether the model can faithfully reproduce known 3D geometry and depth from the generated frames.
  • Action‑Following: measures if the world reacts correctly to a prescribed vehicle control sequence (steering, throttle).
  • Downstream Task: evaluates performance of a downstream autonomous‑driving stack (e.g., perception, planning) when run inside the synthetic world.
  • Human Preference: captures subjective judgments of realism and plausibility via crowdsourced ratings.

Data Collection

  • Real‑world driving logs (sensor streams, control commands) are used as seeds.
  • Multiple generative world models synthesize 4D videos from these seeds.
  • Human annotators watch the videos, assign a 1‑10 score, and write short rationales (e.g., “car drifts off road despite no steering input”).

Training WorldLens‑Agent

  • A multimodal transformer ingests video frames, control signals, and optional depth maps.
  • Supervision comes from the numeric scores and rationales, enabling the model to predict both a scalar quality score and an explanatory text snippet.

Benchmark Execution

  • Each model is run through the five axes; scores are aggregated to produce a “World Fidelity” profile that highlights strengths and weaknesses.

Results & Findings

  • No Universal Winner: Models that excel in photorealistic texture (e.g., diffusion‑based generators) often produce physically impossible motions (e.g., cars sliding without friction). Conversely, geometry‑focused models maintain consistent depth but generate bland, low‑detail visuals.
  • Correlation with Human Judgment: WorldLens‑Agent’s predicted scores achieve a Pearson correlation of 0.78 with human ratings, and its generated rationales match annotator explanations in ≈70 % of cases (BLEU‑4).
  • Downstream Impact: When an autonomous‑driving perception stack is evaluated on the synthetic worlds, performance drops up to 45 % relative to real data for models with low physics fidelity, underscoring the practical cost of unrealistic dynamics.
  • Action‑Following Gap: Even the best‑performing model obeys only ≈80 % of prescribed control commands, indicating room for improvement in closed‑loop simulation.

Practical Implications

  • Safer Simulation‑Based Development: Engineers can now quantitatively select world models that preserve physics, reducing the risk of “simulation‑to‑real” transfer failures in autonomous‑driving pipelines.
  • Benchmark‑Driven Model Design: The five‑axis framework encourages researchers to balance texture, geometry, and dynamics rather than optimizing a single visual metric.
  • Scalable Quality Assurance: WorldLens‑Agent provides an automated, explainable scoring service that can be integrated into CI pipelines for generative simulation tools, flagging unrealistic outputs before they reach downstream testing.
  • Cross‑Domain Adoption: Although focused on driving, the benchmark’s structure (generation → reconstruction → action → task → human) can be adapted to robotics, AR/VR, and any domain where synthetic worlds must be both believable and functional.

Limitations & Future Work

  • Domain Specificity: The benchmark is built around urban driving scenarios; extending to off‑road, aerial, or indoor environments will require new data and possibly additional evaluation axes.
  • Annotation Cost: High‑quality human rationales are expensive to collect at scale; future work could explore semi‑supervised or active‑learning approaches to reduce labeling burden.
  • Agent Generalization: WorldLens‑Agent currently assumes access to control signals; handling purely generative models without explicit action inputs remains an open challenge.
  • Real‑World Validation: While downstream task performance drops are indicative, a full end‑to‑end real‑world deployment test (e.g., on a test vehicle) would cement the benchmark’s relevance.

WorldLens offers the first unified yardstick for measuring not just how pretty a generated driving world looks, but how faithfully it behaves—paving the way for more reliable simulation‑first development in autonomous systems.

Authors

  • Ao Liang
  • Lingdong Kong
  • Tianyi Yan
  • Hongsi Liu
  • Wesley Yang
  • Ziqi Huang
  • Wei Yin
  • Jialong Zuo
  • Yixuan Hu
  • Dekai Zhu
  • Dongyue Lu
  • Youquan Liu
  • Guangfeng Jiang
  • Linfeng Li
  • Xiangtai Li
  • Long Zhuo
  • Lai Xing Ng
  • Benoit R. Cottereau
  • Changxin Gao
  • Liang Pan
  • Wei Tsang Ooi
  • Ziwei Liu

Paper Information

  • arXiv ID: 2512.10958v1
  • Categories: cs.CV
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »