[Paper] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Source: arXiv - 2512.10958v1
Overview
The paper WorldLens tackles a growing blind spot in generative driving world models: while they can produce photorealistic scenes, they often stumble on geometry, physics, and controllability. To bring rigor to this space, the authors present a comprehensive benchmark that evaluates models across the entire “world‑building” pipeline—from raw video generation to downstream autonomous‑driving tasks—paired with a large human‑annotated dataset and a learned evaluation agent.
Key Contributions
- WorldLens Benchmark – a five‑dimensional suite (Generation, Reconstruction, Action‑Following, Downstream Task, Human Preference) that jointly measures visual realism, geometric consistency, physical plausibility, and functional reliability.
- WorldLens‑26K Dataset – 26,000 human‑rated driving videos with numeric scores and textual rationales, covering a wide range of failure modes.
- WorldLens‑Agent – a distilled evaluation model trained on the human annotations, capable of providing scalable, explainable scores for new world‑model outputs.
- Comprehensive Empirical Study – systematic evaluation of several state‑of‑the‑art generative world models, revealing trade‑offs (e.g., texture quality vs. physics fidelity).
- Open‑source Ecosystem – benchmark code, dataset, and evaluation agent are released to encourage reproducibility and future extensions.
Methodology
Defining Evaluation Axes
- Generation: assesses raw video quality (e.g., sharpness, texture realism).
- Reconstruction: checks whether the model can faithfully reproduce known 3D geometry and depth from the generated frames.
- Action‑Following: measures if the world reacts correctly to a prescribed vehicle control sequence (steering, throttle).
- Downstream Task: evaluates performance of a downstream autonomous‑driving stack (e.g., perception, planning) when run inside the synthetic world.
- Human Preference: captures subjective judgments of realism and plausibility via crowdsourced ratings.
Data Collection
- Real‑world driving logs (sensor streams, control commands) are used as seeds.
- Multiple generative world models synthesize 4D videos from these seeds.
- Human annotators watch the videos, assign a 1‑10 score, and write short rationales (e.g., “car drifts off road despite no steering input”).
Training WorldLens‑Agent
- A multimodal transformer ingests video frames, control signals, and optional depth maps.
- Supervision comes from the numeric scores and rationales, enabling the model to predict both a scalar quality score and an explanatory text snippet.
Benchmark Execution
- Each model is run through the five axes; scores are aggregated to produce a “World Fidelity” profile that highlights strengths and weaknesses.
Results & Findings
- No Universal Winner: Models that excel in photorealistic texture (e.g., diffusion‑based generators) often produce physically impossible motions (e.g., cars sliding without friction). Conversely, geometry‑focused models maintain consistent depth but generate bland, low‑detail visuals.
- Correlation with Human Judgment: WorldLens‑Agent’s predicted scores achieve a Pearson correlation of 0.78 with human ratings, and its generated rationales match annotator explanations in ≈70 % of cases (BLEU‑4).
- Downstream Impact: When an autonomous‑driving perception stack is evaluated on the synthetic worlds, performance drops up to 45 % relative to real data for models with low physics fidelity, underscoring the practical cost of unrealistic dynamics.
- Action‑Following Gap: Even the best‑performing model obeys only ≈80 % of prescribed control commands, indicating room for improvement in closed‑loop simulation.
Practical Implications
- Safer Simulation‑Based Development: Engineers can now quantitatively select world models that preserve physics, reducing the risk of “simulation‑to‑real” transfer failures in autonomous‑driving pipelines.
- Benchmark‑Driven Model Design: The five‑axis framework encourages researchers to balance texture, geometry, and dynamics rather than optimizing a single visual metric.
- Scalable Quality Assurance: WorldLens‑Agent provides an automated, explainable scoring service that can be integrated into CI pipelines for generative simulation tools, flagging unrealistic outputs before they reach downstream testing.
- Cross‑Domain Adoption: Although focused on driving, the benchmark’s structure (generation → reconstruction → action → task → human) can be adapted to robotics, AR/VR, and any domain where synthetic worlds must be both believable and functional.
Limitations & Future Work
- Domain Specificity: The benchmark is built around urban driving scenarios; extending to off‑road, aerial, or indoor environments will require new data and possibly additional evaluation axes.
- Annotation Cost: High‑quality human rationales are expensive to collect at scale; future work could explore semi‑supervised or active‑learning approaches to reduce labeling burden.
- Agent Generalization: WorldLens‑Agent currently assumes access to control signals; handling purely generative models without explicit action inputs remains an open challenge.
- Real‑World Validation: While downstream task performance drops are indicative, a full end‑to‑end real‑world deployment test (e.g., on a test vehicle) would cement the benchmark’s relevance.
WorldLens offers the first unified yardstick for measuring not just how pretty a generated driving world looks, but how faithfully it behaves—paving the way for more reliable simulation‑first development in autonomous systems.
Authors
- Ao Liang
- Lingdong Kong
- Tianyi Yan
- Hongsi Liu
- Wesley Yang
- Ziqi Huang
- Wei Yin
- Jialong Zuo
- Yixuan Hu
- Dekai Zhu
- Dongyue Lu
- Youquan Liu
- Guangfeng Jiang
- Linfeng Li
- Xiangtai Li
- Long Zhuo
- Lai Xing Ng
- Benoit R. Cottereau
- Changxin Gao
- Liang Pan
- Wei Tsang Ooi
- Ziwei Liu
Paper Information
- arXiv ID: 2512.10958v1
- Categories: cs.CV
- Published: December 11, 2025
- PDF: Download PDF