[Paper] CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation
Source: arXiv - 2602.04868v1
Overview
The paper introduces CRoSS, a new benchmark suite that lets researchers train and evaluate continual reinforcement‑learning (CRL) agents on realistically simulated robots. By leveraging the Gazebo physics engine and a variety of sensor modalities, CRoSS offers a high‑fidelity, highly extensible platform for studying how agents can learn a sequence of tasks without forgetting earlier skills.
Key Contributions
- Two fully simulated robot platforms – a differential‑drive robot (lidar, camera, bumper) and a 7‑DoF robotic arm, covering both mobile‑robot and manipulation domains.
- Large task diversity – systematic variation of visual textures, arena layouts, and object properties yields hundreds of distinct line‑following, object‑pushing, and goal‑reaching tasks.
- Dual‑level control for the arm – high‑level Cartesian targets (mirroring the Continual World benchmark) and low‑level joint‑angle commands, plus a kinematics‑only mode that runs ~100× faster when physics isn’t needed.
- Containerized, reproducible setup – an Apptainer (formerly Singularity) image ships with all dependencies, enabling one‑click launches on Linux, HPC clusters, or cloud VMs.
- Baseline results – performance numbers for classic RL algorithms (DQN, PPO, SAC) across the full task suite, establishing reference points for future CRL work.
Methodology
-
Simulation Environment – CRoSS builds on the open‑source Gazebo simulator, which provides accurate rigid‑body dynamics, contact modeling, and sensor noise. The two robots are defined using URDF files and equipped with plugins that expose raw sensor streams (e.g., lidar point clouds, RGB images) to the learning agent.
-
Task Generation – For each robot, a parameter grid controls aspects such as arena size, line curvature, object shape, lighting, and texture. Sampling this grid creates a task sequence that the agent must master consecutively.
-
Continual Learning Protocol – Agents are trained on one task until a performance threshold is met, then the environment is switched to the next task without resetting the policy network. Metrics such as average return, forgetting rate, and forward transfer are logged.
-
Baseline Algorithms – The authors implemented three off‑the‑shelf RL methods:
- DQN (value‑based, discrete actions for the wheeled robot)
- PPO (policy‑gradient, continuous actions for the arm)
- SAC (soft‑actor‑critic, continuous actions with entropy regularization)
Each algorithm uses a modest neural architecture (2–3 hidden layers, 256 units each) and standard hyper‑parameters, allowing fair comparison across tasks.
-
Fast‑Path Kinematics Mode – For manipulation tasks that don’t require tactile feedback, the physics engine can be bypassed. The arm’s forward kinematics are computed analytically, cutting simulation time from ~30 ms per step to ~0.3 ms, which is useful for large‑scale hyper‑parameter sweeps.
Results & Findings
| Robot | Benchmark | Algorithm | Final Avg. Return | Forgetting (Δ after 5 tasks) |
|---|---|---|---|---|
| Wheeled | Line‑follow (100 variants) | DQN | 0.78 (normalized) | 0.12 |
| Wheeled | Object‑push (80 variants) | DQN | 0.71 | 0.18 |
| Arm (high‑level) | Goal‑reach (50 variants) | PPO | 0.84 | 0.09 |
| Arm (low‑level) | Goal‑reach (50 variants) | SAC | 0.88 | 0.07 |
- Learning curves show that agents quickly adapt to the first few tasks but experience a modest drop in performance on earlier tasks as the sequence progresses—typical of catastrophic forgetting.
- Policy gradient methods (PPO, SAC) outperform the value‑based DQN on continuous‑control tasks, and they also exhibit lower forgetting rates.
- Kinematics‑only mode yields identical learning performance for the arm while reducing wall‑clock training time by roughly 100×, confirming that full physics simulation isn’t always necessary for certain CRL studies.
Practical Implications
- Robotics developers can use CRoSS as a drop‑in testbed for continual‑learning pipelines before deploying on real hardware, reducing costly trial‑and‑error on physical robots.
- Simulation‑to‑real transfer is facilitated by the realistic sensor models (camera noise, lidar dropout) and physics, making policies trained in CRoSS a strong starting point for sim‑2‑real fine‑tuning.
- Benchmarking new CRL algorithms becomes more transparent: the containerized environment eliminates “works on my machine” issues and the task suite’s parameterization lets teams design custom curricula (e.g., curriculum learning, meta‑learning).
- Edge‑compute research benefits from the fast kinematics mode, enabling rapid iteration on lightweight models that could eventually run on embedded robot controllers.
Limitations & Future Work
- Simulation fidelity vs. speed trade‑off: while Gazebo offers high realism, it remains slower than pure kinematic simulators, which may limit large‑scale hyper‑parameter sweeps for physics‑heavy tasks.
- Sensor diversity is still bounded: the suite currently supports lidar, RGB camera, and bumper; adding tactile or force‑torque sensors would broaden applicability to more dexterous manipulation scenarios.
- Task ordering is fixed in the presented experiments; exploring adaptive curricula or adversarial task sequences could reveal deeper insights into continual learning dynamics.
- Real‑world validation is left for future work—bridging the gap between CRoSS policies and actual robot deployments will be essential to confirm the benchmark’s practical relevance.
Authors
- Yannick Denker
- Alexander Gepperth
Paper Information
- arXiv ID: 2602.04868v1
- Categories: cs.LG, cs.AI
- Published: February 4, 2026
- PDF: Download PDF