[Paper] CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation

Published: 2 months ago (February 4, 2026 at 01:54 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.04868v1

Overview

The paper introduces CRoSS, a new benchmark suite that lets researchers train and evaluate continual reinforcement‑learning (CRL) agents on realistically simulated robots. By leveraging the Gazebo physics engine and a variety of sensor modalities, CRoSS offers a high‑fidelity, highly extensible platform for studying how agents can learn a sequence of tasks without forgetting earlier skills.

Key Contributions

Two fully simulated robot platforms – a differential‑drive robot (lidar, camera, bumper) and a 7‑DoF robotic arm, covering both mobile‑robot and manipulation domains.
Large task diversity – systematic variation of visual textures, arena layouts, and object properties yields hundreds of distinct line‑following, object‑pushing, and goal‑reaching tasks.
Dual‑level control for the arm – high‑level Cartesian targets (mirroring the Continual World benchmark) and low‑level joint‑angle commands, plus a kinematics‑only mode that runs ~100× faster when physics isn’t needed.
Containerized, reproducible setup – an Apptainer (formerly Singularity) image ships with all dependencies, enabling one‑click launches on Linux, HPC clusters, or cloud VMs.
Baseline results – performance numbers for classic RL algorithms (DQN, PPO, SAC) across the full task suite, establishing reference points for future CRL work.

Methodology

Simulation Environment – CRoSS builds on the open‑source Gazebo simulator, which provides accurate rigid‑body dynamics, contact modeling, and sensor noise. The two robots are defined using URDF files and equipped with plugins that expose raw sensor streams (e.g., lidar point clouds, RGB images) to the learning agent.
Task Generation – For each robot, a parameter grid controls aspects such as arena size, line curvature, object shape, lighting, and texture. Sampling this grid creates a task sequence that the agent must master consecutively.
Continual Learning Protocol – Agents are trained on one task until a performance threshold is met, then the environment is switched to the next task without resetting the policy network. Metrics such as average return, forgetting rate, and forward transfer are logged.
Baseline Algorithms – The authors implemented three off‑the‑shelf RL methods:
- DQN (value‑based, discrete actions for the wheeled robot)
- PPO (policy‑gradient, continuous actions for the arm)
- SAC (soft‑actor‑critic, continuous actions with entropy regularization)
Each algorithm uses a modest neural architecture (2–3 hidden layers, 256 units each) and standard hyper‑parameters, allowing fair comparison across tasks.
Fast‑Path Kinematics Mode – For manipulation tasks that don’t require tactile feedback, the physics engine can be bypassed. The arm’s forward kinematics are computed analytically, cutting simulation time from ~30 ms per step to ~0.3 ms, which is useful for large‑scale hyper‑parameter sweeps.

Results & Findings

Robot	Benchmark	Algorithm	Final Avg. Return	Forgetting (Δ after 5 tasks)
Wheeled	Line‑follow (100 variants)	DQN	0.78 (normalized)	0.12
Wheeled	Object‑push (80 variants)	DQN	0.71	0.18
Arm (high‑level)	Goal‑reach (50 variants)	PPO	0.84	0.09
Arm (low‑level)	Goal‑reach (50 variants)	SAC	0.88	0.07

Learning curves show that agents quickly adapt to the first few tasks but experience a modest drop in performance on earlier tasks as the sequence progresses—typical of catastrophic forgetting.
Policy gradient methods (PPO, SAC) outperform the value‑based DQN on continuous‑control tasks, and they also exhibit lower forgetting rates.
Kinematics‑only mode yields identical learning performance for the arm while reducing wall‑clock training time by roughly 100×, confirming that full physics simulation isn’t always necessary for certain CRL studies.

Practical Implications

Robotics developers can use CRoSS as a drop‑in testbed for continual‑learning pipelines before deploying on real hardware, reducing costly trial‑and‑error on physical robots.
Simulation‑to‑real transfer is facilitated by the realistic sensor models (camera noise, lidar dropout) and physics, making policies trained in CRoSS a strong starting point for sim‑2‑real fine‑tuning.
Benchmarking new CRL algorithms becomes more transparent: the containerized environment eliminates “works on my machine” issues and the task suite’s parameterization lets teams design custom curricula (e.g., curriculum learning, meta‑learning).
Edge‑compute research benefits from the fast kinematics mode, enabling rapid iteration on lightweight models that could eventually run on embedded robot controllers.

Limitations & Future Work

Simulation fidelity vs. speed trade‑off: while Gazebo offers high realism, it remains slower than pure kinematic simulators, which may limit large‑scale hyper‑parameter sweeps for physics‑heavy tasks.
Sensor diversity is still bounded: the suite currently supports lidar, RGB camera, and bumper; adding tactile or force‑torque sensors would broaden applicability to more dexterous manipulation scenarios.
Task ordering is fixed in the presented experiments; exploring adaptive curricula or adversarial task sequences could reveal deeper insights into continual learning dynamics.
Real‑world validation is left for future work—bridging the gap between CRoSS policies and actual robot deployments will be essential to confirm the benchmark’s practical relevance.

Authors

Yannick Denker
Alexander Gepperth

Paper Information

arXiv ID: 2602.04868v1
Categories: cs.LG, cs.AI
Published: February 4, 2026
PDF: Download PDF

[Paper] CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data