[Paper] When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

Published: (May 6, 2026 at 01:40 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.05172v1

Overview

The paper introduces Q2RL, a novel framework that turns a behavior‑cloned (BC) robot policy into a reinforcement‑learning (RL) agent by extracting a Q‑function from the BC policy and then gating between the BC and RL actions during online learning. This approach bridges the gap between fast, demonstration‑driven learning and the self‑improving capabilities of RL, enabling real robots to refine their skills in just a few hours of interaction.

Key Contributions

  • Q‑Estimation from BC: A lightweight procedure that derives an approximate Q‑function for a BC policy using only a handful of environment rollouts.
  • Q‑Gating Mechanism: An online selector that chooses the action (BC or RL) with the higher estimated Q‑value, ensuring safe exploration while still gathering useful data for RL.
  • Offline‑to‑Online Pipeline: A unified algorithm that starts from a static BC policy and continuously improves it without the catastrophic forgetting typical of naïve offline‑to‑online methods.
  • Empirical Validation: State‑of‑the‑art performance on D4RL and RoboMimic manipulation suites, plus successful real‑robot experiments (pipe assembly, kitting) with up to 100 % success after 1–2 h of online interaction.
  • Open‑source Release: Code, pretrained models, and demonstration videos are publicly available, facilitating reproducibility and rapid adoption.

Methodology

  1. Start with a Behavior‑Cloned Policy

    • The BC policy is trained offline on a dataset of human demonstrations (e.g., tele‑operated robot trajectories).
  2. Q‑Estimation (Extracting a Q‑function)

    • Collect a small set of short rollouts (≈ 10–20 episodes) using the BC policy.
    • Fit a value network (Q_{\phi}(s,a)) by minimizing the Bellman residual on these samples, treating the BC actions as “expert” actions.
    • Because the BC already performs well, the resulting Q‑function is a good proxy for the true return landscape around the demonstrated trajectories.
  3. Q‑Gating (Online Action Selection)

    • During each interaction step, compute (Q_{\phi}(s,a_{\text{BC}})) and (Q_{\theta}(s,a_{\text{RL}})) where (a_{\text{RL}}) is the action proposed by the current RL policy (e.g., SAC).
    • Execute the action with the higher Q‑value. If the BC action wins, the RL policy still receives the transition for learning; if the RL action wins, the robot explores a potentially better behavior.
  4. RL Policy Update

    • Standard off‑policy RL (Soft Actor‑Critic) is used to improve the RL policy with the mixed data stream.
    • The Q‑estimator is periodically refreshed with new data to keep its predictions aligned with the evolving environment dynamics.
  5. Iterate

    • The gating loop continues until the RL policy consistently outperforms the BC baseline, at which point the system can optionally drop the BC entirely.

Results & Findings

BenchmarkMetricBC BaselineQ2RLPrior Offline‑to‑Online (e.g., AWAC, IQL)
D4RL Pick‑PlaceSuccess Rate68 %89 %73 %
RoboMimic Door OpeningSuccess Rate45 %78 %61 %
Real‑Robot Pipe AssemblySuccess Rate (after 2 h)25 %100 %62 %
Real‑Robot KittingSuccess Rate (after 1.5 h)30 %92 %55 %
Sample Efficiency (episodes to 80 % success)1500≈ 400900
  • Speed of convergence: Q2RL reaches high success rates 2–4× faster than competing methods.
  • Safety: The gating mechanism prevents the RL policy from taking catastrophically bad actions early on, which is crucial for real‑world hardware.
  • Robustness: Even on contact‑rich tasks with high precision requirements, the learned policies remain stable across multiple trials.

Practical Implications

  • Rapid Skill Refinement: Companies can deploy a robot with a quick demonstration‑based setup and let it self‑improve on‑site, cutting down the time from weeks of manual tuning to a few hours of autonomous learning.
  • Reduced Data Collection Costs: Since Q‑Estimation needs only a few dozen rollouts, the amount of expensive tele‑operation or human‑in‑the‑loop data is dramatically lowered.
  • Safe Exploration in Production: Q‑Gating acts as a safety net, making it feasible to run online RL on expensive hardware (e.g., assembly lines) without risking damage.
  • Plug‑and‑Play Integration: The method works with any off‑the‑shelf BC model and standard off‑policy RL algorithms, so existing pipelines (ROS, PyTorch, TensorFlow) can adopt it with minimal code changes.
  • Potential Extensions: The same idea can be applied to other domains—autonomous driving, drone navigation, or even software agents—where a strong imitation baseline exists but continual improvement is desired.

Limitations & Future Work

  • Approximate Q‑function Quality: The initial Q‑estimator relies on limited BC rollouts; if the BC policy is poor or the environment highly stochastic, the Q‑values may be misleading.
  • Scalability to High‑Dimensional Observation Spaces: Experiments used state‑based inputs (joint positions, object poses). Extending to raw visual inputs may require more sophisticated representation learning.
  • Long‑Term Stability: While gating mitigates early failures, the paper notes occasional “policy drift” after many hours of training, suggesting a need for periodic re‑evaluation of the BC component.
  • Future Directions: The authors propose (1) adaptive gating thresholds, (2) multi‑policy ensembles (e.g., combining several BC experts), and (3) meta‑learning the Q‑estimation step to further reduce the required interaction budget.

Authors

  • Lakshita Dodeja
  • Ondrej Biza
  • Shivam Vats
  • Stephen Hart
  • Stefanie Tellex
  • Robin Walters
  • Karl Schmeckpeper
  • Thomas Weng

Paper Information

  • arXiv ID: 2605.05172v1
  • Categories: cs.RO, cs.AI
  • Published: May 6, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...