[Paper] PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

Published: (December 18, 2025 at 01:49 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16881v1

Overview

The paper introduces PolaRiS, a framework that turns short video captures of real‑world scenes into high‑fidelity simulated environments, enabling fast, large‑scale “real‑to‑sim” evaluations of generalist robot policies. By bridging the visual and physical gaps between simulation and reality, PolaRiS offers a more reliable proxy for measuring robot performance without the time and cost of extensive real‑world rollouts.

Key Contributions

  • Neural scene reconstruction pipeline that converts brief video scans into interactive, physics‑aware simulation worlds.
  • Zero‑shot evaluation recipe that co‑trains policies on a mix of real and simulated data to close the remaining reality‑gap.
  • Empirical validation showing a significantly higher correlation between PolaRiS simulation scores and real‑world performance compared to existing simulators.
  • Scalable environment generation: a single video can produce a full 3D environment, dramatically reducing manual modeling effort.
  • Open‑source tooling that can be adopted by research labs and industry teams to democratize benchmarking of robotic foundation models.

Methodology

  1. Data Capture – Operators record a short (≈10 s) RGB‑D video of a target scene using a commodity depth camera.
  2. Neural Reconstruction – The video is fed into a neural implicit representation (e.g., a NeRF‑style model) that learns both geometry and appearance while also estimating material properties needed for physics simulation.
  3. Environment Export – The learned representation is converted into a mesh with collision primitives and physical parameters (mass, friction, etc.), which can be loaded into a standard robotics simulator (e.g., PyBullet, Isaac Gym).
  4. Policy Co‑Training – Policies are trained on a mixture of real‑world trajectories and simulated rollouts from the reconstructed environments. A simple domain‑randomization + adversarial loss aligns the simulated observations with real sensor data.
  5. Zero‑Shot Evaluation – Once trained, the policy can be dropped into any newly reconstructed environment without further fine‑tuning, and its performance is measured using standard task metrics (success rate, time‑to‑completion, etc.).

Results & Findings

  • Correlation boost: PolaRiS simulation scores correlated with real‑world success rates at r = 0.78, versus r ≈ 0.45 for conventional simulators (e.g., Habitat, iGibson).
  • Speedup: Evaluating a policy on 100 reconstructed scenes took ≈2 hours on a single GPU, whereas the same number of real‑world rollouts would require ≈150 hours of robot time.
  • Generalization: Policies co‑trained with PolaRiS data achieved +12 % higher success on unseen real‑world tasks compared to policies trained only on synthetic data.
  • Ease of creation: The authors generated 50 diverse kitchen and office environments from under‑5‑minute video captures each, demonstrating rapid scaling.

Practical Implications

  • Rapid benchmarking: Development teams can iterate on policy design and get near‑real performance feedback in minutes rather than days, accelerating the research‑to‑product pipeline.
  • Distributed evaluation: Because the reconstruction pipeline runs on commodity hardware, multiple labs (or even remote field sites) can contribute evaluation environments, fostering community‑wide benchmarking standards.
  • Cost reduction: Companies can cut down on expensive robot time and wear‑and‑tear by shifting most of the evaluation workload to simulation while retaining confidence that results transfer to the real world.
  • Foundation model validation: As large‑scale, multi‑task robot models emerge, PolaRiS offers a scalable “test‑bed” to verify that a single policy truly generalizes across varied, realistic settings.
  • Integration with CI/CD: The lightweight pipeline can be hooked into continuous integration systems, automatically generating new test scenes from field footage and flagging regressions in policy performance.

Limitations & Future Work

  • Reconstruction fidelity: Extremely reflective or transparent surfaces still challenge the neural rendering step, leading to occasional physics inaccuracies.
  • Sensor modality gap: The current pipeline focuses on RGB‑D; extending to tactile, force, or proprioceptive modalities will require additional modeling.
  • Scalability of physics: While geometry is captured well, fine‑grained material properties (e.g., compliance) are approximated, which may affect tasks involving delicate manipulation.
  • Future directions highlighted by the authors include:
    1. Incorporating multi‑view video and active scanning to improve reconstruction quality.
    2. Learning end‑to‑end simulators that directly predict dynamics from raw video.
    3. Building a public repository of reconstructed environments for community benchmarking.

Authors

  • Arhan Jain
  • Mingtong Zhang
  • Kanav Arora
  • William Chen
  • Marcel Torne
  • Muhammad Zubair Irshad
  • Sergey Zakharov
  • Yue Wang
  • Sergey Levine
  • Chelsea Finn
  • Wei‑Chiu Ma
  • Dhruv Shah
  • Abhishek Gupta
  • Karl Pertsch

Paper Information

  • arXiv ID: 2512.16881v1
  • Categories: cs.RO, cs.LG
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...