[Paper] Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning
Source: arXiv - 2603.16842v1
Overview
The paper Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning investigates a surprisingly simple trick borrowed from statistical physics: intermittently forcing an RL agent back to a fixed “reset” state. The authors show that this stochastic resetting can dramatically speed up policy learning—both in classic tabular grid worlds and in modern deep RL settings—without altering the optimal solution.
Key Contributions
- Theoretical bridge: Connects the physics concept of stochastic resetting (used to optimise first‑passage times) with reinforcement‑learning dynamics that evolve through experience.
- Empirical evidence in tabular domains: Demonstrates that resetting shortens the number of updates needed for a policy to converge, even when it does not improve raw search speed for a naïve diffusive agent.
- Deep RL validation: Shows that random resets improve performance on a continuous‑control benchmark with sparse rewards, where exploration is otherwise extremely hard.
- Mechanistic insight: Argues that resetting truncates long, low‑information trajectories, thereby sharpening value‑propagation and accelerating temporal‑difference learning—while leaving the optimal policy unchanged.
- Practical recipe: Provides a lightweight, tunable hyper‑parameter (reset probability) that can be dropped into existing RL pipelines with minimal code changes.
Methodology
- Tabular experiments – The authors use small grid‑worlds (e.g., 5×5 mazes) where the state‑action value table is updated via standard Q‑learning. After each episode, with probability p the agent is teleported back to a designated “reset” cell; otherwise it starts from a random start state.
- Deep RL experiments – They adopt a continuous‑control task (a 2‑D navigation problem with sparse goal reward) and train a Soft Actor‑Critic (SAC) agent equipped with a neural‑network value function. After each episode, the environment may reset the agent to a fixed origin with probability p.
- Metrics – Convergence speed is measured by the number of environment steps needed for the policy’s average return to reach a pre‑defined threshold. They also track the distribution of episode lengths and the variance of TD‑errors.
- Baselines – Comparisons are made against (i) vanilla RL without resets, (ii) increased temporal discounting, and (iii) curriculum‑style start‑state sampling.
All experiments are repeated across multiple random seeds, and the reset probability p is swept from 0 (no reset) to 0.5 to study its effect.
Results & Findings
- Faster convergence in tabular Q‑learning – Even modest reset probabilities (p ≈ 0.1) cut the required learning steps by ~30‑40 % compared to the baseline. The optimal policy remains identical; only the learning trajectory is shortened.
- Benefit despite unchanged first‑passage time – In some mazes, resetting does not reduce the expected time for a random walker to reach the goal, yet it still speeds up policy learning—highlighting a mechanism distinct from classic first‑passage optimisation.
- Deep RL gains in sparse‑reward settings – For the continuous navigation task, SAC with resets reaches the target success rate ~2× faster than vanilla SAC. The improvement is most pronounced when the reward is extremely sparse (only at the goal).
- Reduced TD‑error variance – Resetting truncates long, uninformative trajectories, leading to tighter TD‑error distributions and more stable gradient updates.
- Robustness to reset frequency – Too high a reset probability (p > 0.4) can degrade performance by limiting exposure to diverse states, but a sweet spot around 0.1–0.2 works across tasks.
Practical Implications
- Plug‑and‑play exploration aid – Adding a stochastic reset is as simple as inserting a conditional
env.reset()call after each episode; no changes to the learning algorithm or network architecture are required. - Sparse‑reward problems – In robotics, autonomous navigation, or any domain where meaningful feedback is rare, resets can dramatically shorten the “cold‑start” phase.
- Curriculum design alternative – Instead of hand‑crafting a curriculum of increasingly difficult start states, a random reset provides an automatic way to keep the agent near informative regions of the state space.
- Hyper‑parameter tuning – The reset probability can be treated like a learning‑rate schedule: start low, increase early in training, then decay as the policy stabilises.
- Compatibility with existing frameworks – The technique works with both on‑policy (e.g., PPO) and off‑policy (e.g., DQN, SAC) algorithms, making it a broadly applicable tool for RL engineers.
Limitations & Future Work
- State‑dependence not explored – The paper only investigates a single, fixed reset state. Adaptive or learned reset locations could further improve efficiency.
- Scalability to high‑dimensional tasks – Experiments are limited to modest grid worlds and a low‑dimensional navigation benchmark; it remains unclear how resets behave in complex domains like Atari or MuJoCo.
- Potential bias in non‑ergodic environments – In environments where certain states are only reachable via long trajectories, frequent resets might prevent the agent from ever discovering them.
- Theoretical analysis – While the authors provide intuition, a formal convergence proof for stochastic resetting in deep RL is still open.
Future research directions include learning optimal reset policies, integrating resets with intrinsic‑motivation signals, and extending the analysis to multi‑agent or hierarchical RL settings.
Authors
- Jello Zhou
- Vudtiwat Ngampruetikorn
- David J. Schwab
Paper Information
- arXiv ID: 2603.16842v1
- Categories: cs.LG, cond-mat.dis-nn, cond-mat.stat-mech, eess.SY, physics.bio-ph
- Published: March 17, 2026
- PDF: Download PDF