[Paper] When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

Published: 4 days ago (May 6, 2026 at 01:40 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.05172v1

Overview

The paper introduces Q2RL, a novel framework that turns a behavior‑cloned (BC) robot policy into a reinforcement‑learning (RL) agent by extracting a Q‑function from the BC policy and then gating between the BC and RL actions during online learning. This approach bridges the gap between fast, demonstration‑driven learning and the self‑improving capabilities of RL, enabling real robots to refine their skills in just a few hours of interaction.

Key Contributions

Q‑Estimation from BC: A lightweight procedure that derives an approximate Q‑function for a BC policy using only a handful of environment rollouts.
Q‑Gating Mechanism: An online selector that chooses the action (BC or RL) with the higher estimated Q‑value, ensuring safe exploration while still gathering useful data for RL.
Offline‑to‑Online Pipeline: A unified algorithm that starts from a static BC policy and continuously improves it without the catastrophic forgetting typical of naïve offline‑to‑online methods.
Empirical Validation: State‑of‑the‑art performance on D4RL and RoboMimic manipulation suites, plus successful real‑robot experiments (pipe assembly, kitting) with up to 100 % success after 1–2 h of online interaction.
Open‑source Release: Code, pretrained models, and demonstration videos are publicly available, facilitating reproducibility and rapid adoption.

Methodology

Start with a Behavior‑Cloned Policy
- The BC policy is trained offline on a dataset of human demonstrations (e.g., tele‑operated robot trajectories).
Q‑Estimation (Extracting a Q‑function)
- Collect a small set of short rollouts (≈ 10–20 episodes) using the BC policy.
- Fit a value network (Q_{\phi}(s,a)) by minimizing the Bellman residual on these samples, treating the BC actions as “expert” actions.
- Because the BC already performs well, the resulting Q‑function is a good proxy for the true return landscape around the demonstrated trajectories.
Q‑Gating (Online Action Selection)
- During each interaction step, compute (Q_{\phi}(s,a_{\text{BC}})) and (Q_{\theta}(s,a_{\text{RL}})) where (a_{\text{RL}}) is the action proposed by the current RL policy (e.g., SAC).
- Execute the action with the higher Q‑value. If the BC action wins, the RL policy still receives the transition for learning; if the RL action wins, the robot explores a potentially better behavior.
RL Policy Update
- Standard off‑policy RL (Soft Actor‑Critic) is used to improve the RL policy with the mixed data stream.
- The Q‑estimator is periodically refreshed with new data to keep its predictions aligned with the evolving environment dynamics.
Iterate
- The gating loop continues until the RL policy consistently outperforms the BC baseline, at which point the system can optionally drop the BC entirely.

Results & Findings

Benchmark	Metric	BC Baseline	Q2RL	Prior Offline‑to‑Online (e.g., AWAC, IQL)
D4RL Pick‑Place	Success Rate	68 %	89 %	73 %
RoboMimic Door Opening	Success Rate	45 %	78 %	61 %
Real‑Robot Pipe Assembly	Success Rate (after 2 h)	25 %	100 %	62 %
Real‑Robot Kitting	Success Rate (after 1.5 h)	30 %	92 %	55 %
Sample Efficiency (episodes to 80 % success)	–	1500	≈ 400	900

Speed of convergence: Q2RL reaches high success rates 2–4× faster than competing methods.
Safety: The gating mechanism prevents the RL policy from taking catastrophically bad actions early on, which is crucial for real‑world hardware.
Robustness: Even on contact‑rich tasks with high precision requirements, the learned policies remain stable across multiple trials.

Practical Implications

Rapid Skill Refinement: Companies can deploy a robot with a quick demonstration‑based setup and let it self‑improve on‑site, cutting down the time from weeks of manual tuning to a few hours of autonomous learning.
Reduced Data Collection Costs: Since Q‑Estimation needs only a few dozen rollouts, the amount of expensive tele‑operation or human‑in‑the‑loop data is dramatically lowered.
Safe Exploration in Production: Q‑Gating acts as a safety net, making it feasible to run online RL on expensive hardware (e.g., assembly lines) without risking damage.
Plug‑and‑Play Integration: The method works with any off‑the‑shelf BC model and standard off‑policy RL algorithms, so existing pipelines (ROS, PyTorch, TensorFlow) can adopt it with minimal code changes.
Potential Extensions: The same idea can be applied to other domains—autonomous driving, drone navigation, or even software agents—where a strong imitation baseline exists but continual improvement is desired.

Limitations & Future Work

Approximate Q‑function Quality: The initial Q‑estimator relies on limited BC rollouts; if the BC policy is poor or the environment highly stochastic, the Q‑values may be misleading.
Scalability to High‑Dimensional Observation Spaces: Experiments used state‑based inputs (joint positions, object poses). Extending to raw visual inputs may require more sophisticated representation learning.
Long‑Term Stability: While gating mitigates early failures, the paper notes occasional “policy drift” after many hours of training, suggesting a need for periodic re‑evaluation of the BC component.
Future Directions: The authors propose (1) adaptive gating thresholds, (2) multi‑policy ensembles (e.g., combining several BC experts), and (3) meta‑learning the Q‑estimation step to further reduce the required interaction budget.

Authors

Lakshita Dodeja
Ondrej Biza
Shivam Vats
Stephen Hart
Stefanie Tellex
Robin Walters
Karl Schmeckpeper
Thomas Weng

Paper Information

arXiv ID: 2605.05172v1
Categories: cs.RO, cs.AI
Published: May 6, 2026
PDF: Download PDF

[Paper] When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction