[Paper] WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Published: (February 9, 2026 at 01:09 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.08971v1

Overview

The paper introduces WorldArena, the first large‑scale benchmark that evaluates embodied world models not only for how realistic their video predictions look, but also for how useful those predictions are when an agent has to think, plan, and act in a simulated environment. By unifying perceptual and functional assessments, the authors expose a hidden “perception‑functionality gap” that has major consequences for developers building next‑generation embodied AI systems.

Key Contributions

  • Unified benchmark (WorldArena) that simultaneously measures:
    1. Video perception quality (16 metrics covering fidelity, temporal consistency, semantics, etc.).
    2. Functional utility in three downstream roles: data engine, policy evaluator, and action planner.
    3. Human‑in‑the‑loop subjective evaluation for real‑world relevance.
  • EWMScore, a single interpretable index that aggregates multi‑dimensional results, making it easy to compare models at a glance.
  • Comprehensive evaluation of 14 state‑of‑the‑art embodied world models, revealing that high visual quality does not guarantee strong task performance.
  • Public leaderboard and open‑source code (https://worldarena.ai) to foster reproducible research and continuous progress.

Methodology

  1. Dataset & Scenarios – WorldArena builds on several widely used simulated environments (e.g., Habitat, AI2‑THOR) and defines a set of canonical tasks such as navigation, object search, and manipulation.
  2. Perceptual Scoring – For each predicted video sequence, the benchmark computes 16 metrics grouped into six sub‑dimensions (pixel‑level fidelity, motion smoothness, semantic consistency, etc.). These are standard computer‑vision scores (PSNR, SSIM, LPIPS) plus newer ones that capture temporal and object‑level coherence.
  3. Functional Evaluation – The same world model is plugged into three functional pipelines:
    • Data Engine – Generates synthetic experience for downstream RL agents; performance is measured by the agent’s learning curve.
    • Policy Evaluator – Scores candidate policies by simulating outcomes; accuracy is compared against ground‑truth optimal policies.
    • Action Planner – Directly selects actions for a task; success rate and efficiency are recorded.
  4. Human Judgment – A crowd‑sourced study asks participants to rank model outputs on realism and task usefulness, providing a subjective sanity check.
  5. EWMScore Calculation – All metrics are normalized, weighted (weights derived from a small validation set to reflect practical importance), and summed into a single score ranging from 0–100.

Results & Findings

  • Perception‑Functionality Gap – The top‑performing models on visual metrics (e.g., 95 PSNR) often rank near the bottom on functional tasks (≈30 % success). Conversely, some models with modest video quality achieve competitive planning performance.
  • Role‑Specific Strengths – Certain architectures excel as data engines (producing diverse, high‑entropy trajectories) while others are better action planners (more accurate dynamics for short‑horizon predictions).
  • Human vs. Automated Scores – Human rankings correlate strongly with functional metrics (r ≈ 0.78) but only weakly with pure perceptual scores (r ≈ 0.32), underscoring the importance of task‑oriented evaluation.
  • EWMScore Rankings – The aggregated leaderboard shows a reshuffling of the “state‑of‑the‑art” order, with a few under‑appreciated models emerging as the most balanced.

Practical Implications

  • Model Selection for Products – Developers building robotics or AR/VR agents should prioritize functional benchmarks (e.g., planning success) over raw video quality when choosing a world model.
  • Data‑Efficient Training – Using world models that score high as data engines can dramatically reduce the amount of real‑world interaction needed for RL agents, cutting costs for simulation‑to‑real pipelines.
  • Safety & Reliability – Functional evaluation surfaces failure modes (e.g., unrealistic object physics) that pure visual metrics miss, helping engineers build safer autonomous systems.
  • Standardized Reporting – The EWMScore provides a single, comparable figure that can be reported in product specs, similar to “FPS” for graphics or “BLEU” for translation.
  • Community Collaboration – The open leaderboard encourages continuous improvement and makes it easy for startups or open‑source projects to benchmark against academic baselines.

Limitations & Future Work

  • Simulation Bias – WorldArena relies on existing simulators; any domain gap between simulation and the real world may limit transferability of the findings.
  • Metric Weighting – The current weighting scheme for EWMScore is derived from a validation set and may not reflect every industry’s priorities (e.g., latency vs. accuracy).
  • Scalability of Human Evaluation – Subjective assessments are costly and may not scale to thousands of model submissions.
  • Future Directions suggested by the authors include extending the benchmark to multi‑agent scenarios, incorporating real‑world sensor modalities (e.g., LiDAR), and exploring adaptive weighting that tailors EWMScore to specific application domains.

Authors

  • Yu Shang
  • Zhuohang Li
  • Yiding Ma
  • Weikang Su
  • Xin Jin
  • Ziyou Wang
  • Xin Zhang
  • Yinzhou Tang
  • Chen Gao
  • Wei Wu
  • Xihui Liu
  • Dhruv Shah
  • Zhaoxiang Zhang
  • Zhibo Chen
  • Jun Zhu
  • Yonghong Tian
  • Tat‑Seng Chua
  • Wenwu Zhu
  • Yong Li

Paper Information

  • arXiv ID: 2602.08971v1
  • Categories: cs.CV, cs.RO
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »