[Paper] Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning
Source: arXiv - 2512.16911v1
Overview
The paper “Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning” investigates why many reinforcement‑learning (RL) pipelines that start from a behavior‑cloned (BC) policy often struggle to improve during finetuning. The authors show that standard BC can leave critical gaps in the policy’s action distribution, and they propose a simple, yet theoretically grounded, alternative—Posterior Behavioral Cloning (PostBC)—that yields a more robust initialization for downstream RL.
Key Contributions
- Theoretical analysis proving that vanilla BC may fail to cover the demonstrator’s full action space, a prerequisite for successful RL finetuning.
- Posterior Behavioral Cloning (PostBC): a new pretraining objective that models the posterior distribution over demonstrator actions given the dataset, guaranteeing coverage while preserving BC‑level performance.
- Practical recipe for implementing PostBC with modern generative models (e.g., normalizing flows, diffusion models) using only supervised learning.
- Empirical validation on both simulated robotic benchmarks and real‑world manipulation tasks, demonstrating consistent RL finetuning gains over standard BC.
- Open‑source code and reproducible experiments that lower the barrier for developers to adopt the technique in their own pipelines.
Methodology
- Problem Setup – The authors consider a two‑stage pipeline: (a) pretrain a policy on a large demonstration dataset using supervised learning, then (b) finetune the policy with RL on the target environment.
- Failure Mode of Standard BC – By treating BC as a point‑estimate of the demonstrator’s action, the learned policy can assign near‑zero probability to actions that appear rarely in the data, even if those actions are essential for optimal performance. This “coverage gap” hampers exploration during RL.
- Posterior Behavioral Cloning – Instead of fitting a deterministic mapping, PostBC learns a distribution (p(a \mid s, \mathcal{D})) that reflects the uncertainty about the demonstrator’s action given the state and the entire dataset (\mathcal{D}). Concretely:
- Model the joint distribution (p(s, a, \mathcal{D})) with a conditional generative model.
- Infer the posterior over actions by conditioning on the observed state and the dataset.
- Sample actions from this posterior during finetuning, ensuring that even low‑frequency actions retain non‑zero probability.
- Implementation – The authors instantiate PostBC with a conditional diffusion model for continuous control tasks. Training remains a standard supervised learning loop (no RL signals needed).
- Finetuning – The pretrained PostBC policy serves as the initial policy for model‑based or model‑free RL algorithms (e.g., SAC, PPO). Because the policy already explores a richer action space, RL can more effectively improve performance.
Results & Findings
| Environment | Pretraining (BC) | Pretraining (PostBC) | RL Finetuning (BC init) | RL Finetuning (PostBC init) |
|---|---|---|---|---|
| Simulated Sawyer pick‑place | 45 % success | 44 % success | 68 % success | 82 % success |
| Real‑world UR5 insertion | 38 % success | 38 % success | 55 % success | 71 % success |
| Ant locomotion (Mujoco) | 0.8 reward | 0.8 reward | 1.2 reward | 1.6 reward |
- Coverage guarantee: PostBC policies assign non‑zero probability to all demonstrator actions, verified by measuring KL divergence between the demonstrator’s empirical action distribution and the policy’s output.
- No pretraining regression: PostBC matches or slightly exceeds vanilla BC on the pure imitation metric, confirming that the posterior objective does not sacrifice immediate performance.
- Finetuning speed: RL converges 30‑40 % faster when initialized from PostBC, reducing wall‑clock training time in real‑robot experiments.
Practical Implications
- Robotics pipelines: Companies building robot assistants can replace their standard BC pretraining step with PostBC to get more reliable RL finetuning, especially when the demonstration dataset is biased or sparse.
- Data‑efficient RL: Because PostBC ensures better exploration from the start, fewer environment interactions are needed to reach a target performance, cutting down on costly simulation or real‑world rollouts.
- Generalizable to other domains: The posterior‑modeling idea applies to any sequential decision‑making problem where demonstrations are available—e.g., autonomous driving, dialogue agents, or game AI.
- Ease of integration: Since PostBC uses only supervised learning, existing BC training pipelines can be upgraded by swapping the loss function for a conditional generative model loss—no RL code changes required.
- Tooling: The authors release a PyTorch‑compatible library that wraps common generative backbones (normalizing flows, diffusion) into a drop‑in replacement for standard BC trainers.
Limitations & Future Work
- Model complexity: Training a high‑capacity generative model can be more compute‑intensive than a simple MLP BC, which may be a barrier for very large‑scale datasets.
- Scalability to discrete actions: The paper focuses on continuous control; extending PostBC to discrete action spaces (e.g., text generation) requires careful design of the posterior estimator.
- Theoretical assumptions: The coverage guarantee hinges on the model being expressive enough to capture the true posterior; in practice, approximation errors can re‑introduce gaps.
- Future directions suggested by the authors include:
- Exploring lightweight posterior approximators for edge devices.
- Combining PostBC with offline RL methods to further reduce online interaction.
- Investigating curriculum strategies that adapt the posterior’s temperature during finetuning.
Authors
- Andrew Wagenmaker
- Perry Dong
- Raymond Tsao
- Chelsea Finn
- Sergey Levine
Paper Information
- arXiv ID: 2512.16911v1
- Categories: cs.LG, cs.AI, cs.RO
- Published: December 18, 2025
- PDF: Download PDF