[Paper] Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation

Published: (January 12, 2026 at 01:53 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07821v1

Overview

The paper “Failure‑Aware RL: Reliable Offline‑to‑Online Reinforcement Learning with Self‑Recovery for Real‑World Manipulation” tackles a roadblock that keeps many robotics teams from deploying RL‑based controllers in the field: the risk of intervention‑requiring failures (IR failures) such as spilling liquids or breaking fragile objects during the learning phase. By combining a safety‑oriented world model with an offline‑trained recovery policy, the authors present a framework—FARL—that dramatically cuts down on such costly mishaps while still improving task performance.

Key Contributions

  • FailureBench – a new benchmark suite that injects realistic failure scenarios (e.g., object breakage, spills) into standard manipulation tasks, forcing algorithms to handle human‑intervention cases.
  • FARL paradigm – an offline‑to‑online RL pipeline that explicitly reasons about failure risk using a world‑model‑based safety critic and a self‑recovery policy learned from offline data.
  • Safety‑aware exploration – the safety critic predicts the probability of an IR failure for candidate actions, allowing the agent to reject risky actions before they are executed.
  • Self‑recovery mechanism – when a failure is unavoidable, the recovery policy intervenes to bring the system back to a safe state without human assistance.
  • Empirical validation – extensive simulation and real‑world robot experiments show a 73 % reduction in IR failures and an average 11 % boost in task performance compared with standard offline‑to‑online RL baselines.

Methodology

  1. Offline Data Collection – The robot first gathers a dataset of safe trajectories and a separate set of failure episodes (e.g., dropping a cup).
  2. World‑Model Training – A dynamics model is learned from the offline data to predict future states and the likelihood of entering a failure region.
  3. Safety Critic – Using the world model, the safety critic evaluates each candidate action during online exploration, outputting a risk score. Actions whose risk exceeds a threshold are filtered out.
  4. Recovery Policy – A policy trained offline on the failure episodes learns how to undo or mitigate a failure (e.g., pick up a spilled object, re‑grasp a dropped item). When the safety critic flags an unavoidable failure, the recovery policy is invoked automatically.
  5. Online Fine‑Tuning – The main task policy continues to improve via standard RL updates, but only on actions that passed the safety check, ensuring that learning proceeds without causing additional IR failures.

All components are modular, so developers can swap in alternative world‑model architectures (e.g., ensembles, diffusion models) or recovery strategies without redesigning the whole pipeline.

Results & Findings

SettingIR‑Failure ReductionPerformance Gain*
Simulation (pick‑and‑place)71 %+9 %
Real‑world robot (water‑pouring)73 %+11 %
Generalization to unseen objects68 % reduction+8 % success rate

*Performance measured as task‑specific success rate (e.g., correctly placing an object).

Key Takeaways

  • The safety critic reliably predicts high‑risk actions, cutting down on costly human interventions.
  • The recovery policy restores safe operation in >90 % of failure cases, eliminating the need for manual resets.
  • Even with the safety filter, the main policy still receives enough diverse experience to improve beyond the offline baseline, disproving the “safety‑vs‑learning” trade‑off myth.

Practical Implications

  • Reduced downtime – Manufacturing cells can let robots continue learning on‑the‑fly without frequent human stops for resets or clean‑ups.
  • Lower operational risk – Service robots (e.g., kitchen assistants) can self‑detect and mitigate spills or breakages, improving safety for users and property.
  • Cost‑effective data collection – Teams can safely gather online experience in the field, accelerating the transition from simulation to deployment.
  • Plug‑and‑play safety layer – Because FARL’s safety critic and recovery policy are decoupled from the task policy, existing RL controllers can be retrofitted with minimal code changes.
  • Regulatory friendliness – Demonstrating a quantifiable reduction in hazardous failures helps satisfy safety certifications for collaborative robots.

Limitations & Future Work

  • Model fidelity – The safety critic relies on the accuracy of the learned world model; in highly stochastic environments (e.g., deformable objects) prediction errors can still let risky actions slip through.
  • Recovery scope – The current recovery policy handles a predefined set of failure types; extending it to arbitrary, unforeseen failures remains an open challenge.
  • Scalability to high‑dimensional tasks – Experiments focus on manipulation with a handful of objects; scaling to complex multi‑robot or mobile manipulation scenarios may require more efficient risk‑evaluation strategies.
  • Human‑in‑the‑loop fallback – While FARL reduces IR failures, the system still assumes a human can intervene if the safety filter fails—future work could explore fully autonomous self‑repair without any external supervision.

Overall, FARL offers a pragmatic roadmap for bringing the promise of reinforcement‑learning‑driven robots into real‑world settings where safety and reliability are non‑negotiable.

Authors

  • Huanyu Li
  • Kun Lei
  • Sheng Zang
  • Kaizhe Hu
  • Yongyuan Liang
  • Bo An
  • Xiaoli Li
  • Huazhe Xu

Paper Information

  • arXiv ID: 2601.07821v1
  • Categories: cs.RO, cs.AI, cs.LG
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »