[Paper] PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

Published: 1 month ago (December 18, 2025 at 01:49 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16881v1

Overview

The paper introduces PolaRiS, a framework that turns short video captures of real‑world scenes into high‑fidelity simulated environments, enabling fast, large‑scale “real‑to‑sim” evaluations of generalist robot policies. By bridging the visual and physical gaps between simulation and reality, PolaRiS offers a more reliable proxy for measuring robot performance without the time and cost of extensive real‑world rollouts.

Key Contributions

Neural scene reconstruction pipeline that converts brief video scans into interactive, physics‑aware simulation worlds.
Zero‑shot evaluation recipe that co‑trains policies on a mix of real and simulated data to close the remaining reality‑gap.
Empirical validation showing a significantly higher correlation between PolaRiS simulation scores and real‑world performance compared to existing simulators.
Scalable environment generation: a single video can produce a full 3D environment, dramatically reducing manual modeling effort.
Open‑source tooling that can be adopted by research labs and industry teams to democratize benchmarking of robotic foundation models.

Methodology

Data Capture – Operators record a short (≈10 s) RGB‑D video of a target scene using a commodity depth camera.
Neural Reconstruction – The video is fed into a neural implicit representation (e.g., a NeRF‑style model) that learns both geometry and appearance while also estimating material properties needed for physics simulation.
Environment Export – The learned representation is converted into a mesh with collision primitives and physical parameters (mass, friction, etc.), which can be loaded into a standard robotics simulator (e.g., PyBullet, Isaac Gym).
Policy Co‑Training – Policies are trained on a mixture of real‑world trajectories and simulated rollouts from the reconstructed environments. A simple domain‑randomization + adversarial loss aligns the simulated observations with real sensor data.
Zero‑Shot Evaluation – Once trained, the policy can be dropped into any newly reconstructed environment without further fine‑tuning, and its performance is measured using standard task metrics (success rate, time‑to‑completion, etc.).

Results & Findings

Correlation boost: PolaRiS simulation scores correlated with real‑world success rates at r = 0.78, versus r ≈ 0.45 for conventional simulators (e.g., Habitat, iGibson).
Speedup: Evaluating a policy on 100 reconstructed scenes took ≈2 hours on a single GPU, whereas the same number of real‑world rollouts would require ≈150 hours of robot time.
Generalization: Policies co‑trained with PolaRiS data achieved +12 % higher success on unseen real‑world tasks compared to policies trained only on synthetic data.
Ease of creation: The authors generated 50 diverse kitchen and office environments from under‑5‑minute video captures each, demonstrating rapid scaling.

Practical Implications

Rapid benchmarking: Development teams can iterate on policy design and get near‑real performance feedback in minutes rather than days, accelerating the research‑to‑product pipeline.
Distributed evaluation: Because the reconstruction pipeline runs on commodity hardware, multiple labs (or even remote field sites) can contribute evaluation environments, fostering community‑wide benchmarking standards.
Cost reduction: Companies can cut down on expensive robot time and wear‑and‑tear by shifting most of the evaluation workload to simulation while retaining confidence that results transfer to the real world.
Foundation model validation: As large‑scale, multi‑task robot models emerge, PolaRiS offers a scalable “test‑bed” to verify that a single policy truly generalizes across varied, realistic settings.
Integration with CI/CD: The lightweight pipeline can be hooked into continuous integration systems, automatically generating new test scenes from field footage and flagging regressions in policy performance.

Limitations & Future Work

Reconstruction fidelity: Extremely reflective or transparent surfaces still challenge the neural rendering step, leading to occasional physics inaccuracies.
Sensor modality gap: The current pipeline focuses on RGB‑D; extending to tactile, force, or proprioceptive modalities will require additional modeling.
Scalability of physics: While geometry is captured well, fine‑grained material properties (e.g., compliance) are approximated, which may affect tasks involving delicate manipulation.
Future directions highlighted by the authors include:
1. Incorporating multi‑view video and active scanning to improve reconstruction quality.
2. Learning end‑to‑end simulators that directly predict dynamics from raw video.
3. Building a public repository of reconstructed environments for community benchmarking.

Authors

Arhan Jain
Mingtong Zhang
Kanav Arora
William Chen
Marcel Torne
Muhammad Zubair Irshad
Sergey Zakharov
Yue Wang
Sergey Levine
Chelsea Finn
Wei‑Chiu Ma
Dhruv Shah
Abhishek Gupta
Karl Pertsch

Paper Information

arXiv ID: 2512.16881v1
Categories: cs.RO, cs.LG
Published: December 18, 2025
PDF: Download PDF

[Paper] PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy