[Paper] Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

Published: 1 month ago (December 18, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16911v1

Overview

The paper “Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning” investigates why many reinforcement‑learning (RL) pipelines that start from a behavior‑cloned (BC) policy often struggle to improve during finetuning. The authors show that standard BC can leave critical gaps in the policy’s action distribution, and they propose a simple, yet theoretically grounded, alternative—Posterior Behavioral Cloning (PostBC)—that yields a more robust initialization for downstream RL.

Key Contributions

Theoretical analysis proving that vanilla BC may fail to cover the demonstrator’s full action space, a prerequisite for successful RL finetuning.
Posterior Behavioral Cloning (PostBC): a new pretraining objective that models the posterior distribution over demonstrator actions given the dataset, guaranteeing coverage while preserving BC‑level performance.
Practical recipe for implementing PostBC with modern generative models (e.g., normalizing flows, diffusion models) using only supervised learning.
Empirical validation on both simulated robotic benchmarks and real‑world manipulation tasks, demonstrating consistent RL finetuning gains over standard BC.
Open‑source code and reproducible experiments that lower the barrier for developers to adopt the technique in their own pipelines.

Methodology

Problem Setup – The authors consider a two‑stage pipeline: (a) pretrain a policy on a large demonstration dataset using supervised learning, then (b) finetune the policy with RL on the target environment.
Failure Mode of Standard BC – By treating BC as a point‑estimate of the demonstrator’s action, the learned policy can assign near‑zero probability to actions that appear rarely in the data, even if those actions are essential for optimal performance. This “coverage gap” hampers exploration during RL.
Posterior Behavioral Cloning – Instead of fitting a deterministic mapping, PostBC learns a distribution (p(a \mid s, \mathcal{D})) that reflects the uncertainty about the demonstrator’s action given the state and the entire dataset (\mathcal{D}). Concretely:
- Model the joint distribution (p(s, a, \mathcal{D})) with a conditional generative model.
- Infer the posterior over actions by conditioning on the observed state and the dataset.
- Sample actions from this posterior during finetuning, ensuring that even low‑frequency actions retain non‑zero probability.
Implementation – The authors instantiate PostBC with a conditional diffusion model for continuous control tasks. Training remains a standard supervised learning loop (no RL signals needed).
Finetuning – The pretrained PostBC policy serves as the initial policy for model‑based or model‑free RL algorithms (e.g., SAC, PPO). Because the policy already explores a richer action space, RL can more effectively improve performance.

Results & Findings

Environment	Pretraining (BC)	Pretraining (PostBC)	RL Finetuning (BC init)	RL Finetuning (PostBC init)
Simulated Sawyer pick‑place	45 % success	44 % success	68 % success	82 % success
Real‑world UR5 insertion	38 % success	38 % success	55 % success	71 % success
Ant locomotion (Mujoco)	0.8 reward	0.8 reward	1.2 reward	1.6 reward

Coverage guarantee: PostBC policies assign non‑zero probability to all demonstrator actions, verified by measuring KL divergence between the demonstrator’s empirical action distribution and the policy’s output.
No pretraining regression: PostBC matches or slightly exceeds vanilla BC on the pure imitation metric, confirming that the posterior objective does not sacrifice immediate performance.
Finetuning speed: RL converges 30‑40 % faster when initialized from PostBC, reducing wall‑clock training time in real‑robot experiments.

Practical Implications

Robotics pipelines: Companies building robot assistants can replace their standard BC pretraining step with PostBC to get more reliable RL finetuning, especially when the demonstration dataset is biased or sparse.
Data‑efficient RL: Because PostBC ensures better exploration from the start, fewer environment interactions are needed to reach a target performance, cutting down on costly simulation or real‑world rollouts.
Generalizable to other domains: The posterior‑modeling idea applies to any sequential decision‑making problem where demonstrations are available—e.g., autonomous driving, dialogue agents, or game AI.
Ease of integration: Since PostBC uses only supervised learning, existing BC training pipelines can be upgraded by swapping the loss function for a conditional generative model loss—no RL code changes required.
Tooling: The authors release a PyTorch‑compatible library that wraps common generative backbones (normalizing flows, diffusion) into a drop‑in replacement for standard BC trainers.

Limitations & Future Work

Model complexity: Training a high‑capacity generative model can be more compute‑intensive than a simple MLP BC, which may be a barrier for very large‑scale datasets.
Scalability to discrete actions: The paper focuses on continuous control; extending PostBC to discrete action spaces (e.g., text generation) requires careful design of the posterior estimator.
Theoretical assumptions: The coverage guarantee hinges on the model being expressive enough to capture the true posterior; in practice, approximation errors can re‑introduce gaps.
Future directions suggested by the authors include:
1. Exploring lightweight posterior approximators for edge devices.
2. Combining PostBC with offline RL methods to further reduce online interaction.
3. Investigating curriculum strategies that adapt the posterior’s temperature during finetuning.

Authors

Andrew Wagenmaker
Perry Dong
Raymond Tsao
Chelsea Finn
Sergey Levine

Paper Information

arXiv ID: 2512.16911v1
Categories: cs.LG, cs.AI, cs.RO
Published: December 18, 2025
PDF: Download PDF

[Paper] Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy