[Paper] RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots
Source: arXiv - 2603.04356v1
Overview
RoboCasa365 is a new, open‑source simulation benchmark that lets researchers and engineers train and evaluate “generalist” household robots at scale. By offering 365 everyday tasks spread across 2,500 richly varied kitchen layouts, together with thousands of hours of both human‑recorded and synthetic demonstrations, the platform fills a long‑standing gap: a reproducible, large‑scale yardstick for measuring how close we are to truly versatile home robots.
Key Contributions
- Massive task suite – 365 distinct kitchen‑related manipulation tasks (e.g., “make coffee”, “load dishwasher”, “store leftovers”).
- Diverse environments – 2,500 procedurally generated kitchen scenes covering different layouts, appliance models, and object placements.
- Huge demonstration corpus – > 600 h of real human tele‑operation data + > 1 600 h of high‑fidelity synthetic demonstrations, all timestamped and annotated.
- Unified evaluation API – Standardized metrics for multi‑task learning, foundation‑model pre‑training, and lifelong learning scenarios, enabling fair head‑to‑head comparisons.
- Extensive baseline study – Systematic experiments with state‑of‑the‑art RL, imitation‑learning, and hybrid methods, dissecting the influence of task diversity, dataset size, and environment variation on generalization.
- Open‑source release – Full simulation code, data pipelines, and benchmark scripts are publicly available under a permissive license.
Methodology
RoboCasa365 builds on the existing RoboCasa physics‑based simulator (based on PyBullet/IsaacGym). The authors first procedurally generate a library of kitchen environments by randomizing:
- Layout geometry – cabinet positions, countertop dimensions, appliance locations.
- Object inventory – types, quantities, and initial poses of dishes, food items, utensils, etc.
Each environment is paired with a task definition that specifies a goal state (e.g., “cup placed on coaster”). For data collection, two pipelines are used:
- Human tele‑operation – Skilled operators control a virtual robot arm via a haptic device, producing high‑quality demonstrations.
- Synthetic generation – An automated planner (sampling‑based motion planner + grasp synthesis) creates additional trajectories, then refined with domain randomization to mimic human variability.
All demonstrations are stored as sequences of robot joint commands, RGB‑D observations, and semantic scene graphs. The benchmark defines three evaluation regimes:
| Regime | Goal | Typical Algorithm |
|---|---|---|
| Multi‑task learning | Train a single policy to solve all 365 tasks | Multi‑head RL / Task‑conditioned IL |
| Foundation model pre‑training | Pre‑train on the full demo corpus, then fine‑tune on a subset | Large‑scale behavior cloning + fine‑tuning |
| Lifelong learning | Incrementally add new tasks/environments without catastrophic forgetting | Continual RL / Replay buffers |
Performance is measured with success rate, time‑to‑completion, and a generalization score that penalizes over‑fitting to specific kitchen layouts.
Results & Findings
| Experiment | Key Variable | Outcome |
|---|---|---|
| Scaling demo data (0 h → 2 200 h) | Dataset size | Success rate rose from ~22 % to ~58 % for a baseline behavior‑cloning model, with diminishing returns after ~1 500 h. |
| Varying environment diversity (500 → 2 500 kitchens) | Scene variation | Generalization score improved by ~30 % when training on the full set, confirming that visual and geometric diversity is crucial. |
| Multi‑task vs. single‑task training | Policy scope | A single universal policy achieved ~45 % average success across all tasks, outperforming a collection of 365 task‑specific policies (average ~38 %) in terms of overall data efficiency. |
| Lifelong learning with replay buffer | Catastrophic forgetting | Adding a modest replay buffer (5 % of past data) reduced forgetting from > 70 % drop to < 15 % when introducing 50 new tasks. |
Overall, the authors conclude that both data scale and environment diversity are stronger predictors of generalization than sheer model size. Moreover, a unified universal policy can be more data‑efficient than training many narrow experts, provided the benchmark’s breadth is leveraged.
Practical Implications
- Rapid prototyping for home‑robot startups – Developers can now benchmark new perception‑action pipelines against a realistic, varied kitchen suite before deploying on physical hardware, cutting down costly real‑world trial‑and‑error.
- Foundation‑model pre‑training pipelines – The massive demo corpus is ideal for training large‑scale imitation‑learning models (e.g., Diffusion‑based policies) that can later be fine‑tuned for specific household chores.
- Curriculum design for lifelong robots – Insights on replay‑buffer size and environment randomization give concrete guidelines for building robots that continuously acquire new skills without forgetting old ones.
- Standardized reporting – With a shared API and metrics, companies can publish “success rates on RoboCasa365” alongside real‑world demos, making progress comparable across the industry.
- Simulation‑to‑real transfer research – Because the synthetic demonstrations mimic human variability and the environments are highly diverse, the benchmark serves as a stress test for domain‑randomization and sim‑to‑real techniques, accelerating the path from simulation to a functional kitchen assistant.
Limitations & Future Work
- Simulation fidelity – While physics are reasonably accurate, certain tactile nuances (e.g., soft‑food deformation, precise friction) are still approximated, which may limit direct transfer to real‑world tasks involving delicate manipulation.
- Task scope – The benchmark focuses on kitchen environments; extending to other household domains (living rooms, bathrooms) would broaden applicability.
- Human data bias – The tele‑operated demonstrations come from a relatively small pool of operators, potentially encoding a narrow style of manipulation. Future releases could incorporate crowd‑sourced demos to increase behavioral diversity.
- Scalability of lifelong learning – Experiments added up to 50 new tasks; evaluating truly open‑ended curricula (hundreds of tasks over months) remains an open challenge.
- Benchmark evolution – The authors plan to release an “RoboCasa‑plus” version with dynamic objects (e.g., spilling liquids) and multi‑agent scenarios, which will further stress test generalist policies.
RoboCasa365 marks a significant step toward systematic, large‑scale evaluation of household robots. By lowering the barrier to reproducible benchmarking, it gives developers a concrete playground to iterate on algorithms that could one day turn the dream of a helpful kitchen robot into everyday reality.
Authors
- Soroush Nasiriany
- Sepehr Nasiriany
- Abhiram Maddukuri
- Yuke Zhu
Paper Information
- arXiv ID: 2603.04356v1
- Categories: cs.RO, cs.AI, cs.LG
- Published: March 4, 2026
- PDF: Download PDF