[Paper] BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames
Source: arXiv - 2602.15010v1
Overview
The paper introduces Big Picture Policies (BPP), a new way to teach robots to remember the right moments from their past observations without getting confused by irrelevant details. By letting a vision‑language model pick out a handful of “keyframes” that actually matter for a task, BPP lets imitation‑learning policies reason over long histories while staying robust when deployed in the real world.
Key Contributions
- Identifies the root cause of spurious history‑dependence: limited coverage of possible observation sequences during training leads policies to over‑fit to accidental cues.
- Proposes a compact “keyframe” representation: a vision‑language model extracts a minimal set of task‑relevant frames from any rollout, dramatically shrinking the history space.
- Integrates keyframes into imitation learning: the policy conditions on these selected frames instead of the full raw history, preserving expressivity while improving generalization.
- Extensive empirical validation: experiments on 4 real‑world manipulation tasks (e.g., object search, multi‑step assembly) and 3 simulated benchmarks show up to 70 % higher success rates versus strong baselines.
- Open‑source implementation and dataset (released alongside the paper) to enable reproducibility and further research.
Methodology
- Collect Demonstrations – Standard tele‑operated or scripted trajectories are recorded as sequences of RGB‑D images, robot states, and language instructions.
- Keyframe Detection – A pretrained vision‑language model (e.g., CLIP‑based) scores each frame for relevance to the task description. The top‑K frames (typically 3–5) are kept as the big picture of the episode.
- History Projection – Instead of feeding the entire raw observation stream to the policy, the selected keyframes are encoded (image + language embeddings) and concatenated with the current observation.
- Imitation Learning – A standard behavior‑cloning loss is applied on this compact representation. No extra regularization is needed because the keyframe set already mitigates distribution shift.
- Deployment – At runtime the robot continuously re‑evaluates the relevance scores and updates its keyframe buffer, ensuring that the policy always conditions on the most informative past moments.
The approach is deliberately simple: it swaps a huge, noisy history for a tiny, semantically meaningful one, letting existing imitation‑learning pipelines work unchanged.
Results & Findings
| Environment | Baseline (BC) | BPP | Relative Gain |
|---|---|---|---|
| Real‑world “Find‑and‑Pick” (kitchen) | 32 % | 85 % | +53 % |
| Real‑world “Drawer‑Open‑Then‑Place” | 41 % | 78 % | +37 % |
| Simulated “Multi‑Object Sorting” | 58 % | 71 % | +13 % |
| Simulated “Long‑Horizon Assembly” | 44 % | 66 % | +22 % |
Key observations:
- Robustness to out‑of‑distribution histories – When test trajectories contain novel ordering of sub‑tasks, BPP’s success drops far less than baselines.
- Ablation of keyframe count – Using 1–2 frames hurts performance (insufficient context), while >6 frames yields diminishing returns and re‑introduces spurious correlations.
- Comparison with regularization tricks (e.g., dropout, data augmentation) – Those methods improve marginally, but cannot match the distribution‑coverage benefit of keyframe projection.
Practical Implications
- Simpler data collection – Engineers no longer need to engineer exhaustive history coverage; a modest set of demonstrations suffices because BPP abstracts away irrelevant variations.
- Lower compute and memory footprint – Policies only process a handful of frames per step, enabling deployment on edge devices or low‑power robot controllers.
- Improved reliability for search‑and‑retrieve tasks – Warehouse robots, home assistants, and inspection drones can remember “where they already looked” without over‑fitting to visual quirks of the training environment.
- Plug‑and‑play with existing pipelines – Since BPP sits on top of any imitation‑learning backbone (Transformer, CNN‑RNN, etc.), teams can adopt it without rewriting their training code.
- Potential for hybrid human‑in‑the‑loop debugging – The extracted keyframes are human‑readable, making it easier to diagnose why a policy failed (e.g., missed a crucial intermediate state).
Limitations & Future Work
- Dependence on a strong vision‑language model – If the pretrained model mis‑scores relevance (e.g., in highly cluttered scenes), keyframe selection can miss critical events.
- Fixed K value – The current implementation uses a static number of keyframes; adaptive strategies could better handle tasks with highly variable lengths.
- Limited to visual‑language cues – Tasks that rely heavily on proprioceptive or force feedback may need additional modalities for keyframe extraction.
- Scalability to extremely long horizons – While BPP reduces history size, extremely long‑horizon tasks (e.g., multi‑room navigation) may still require hierarchical keyframe selection, which the authors leave for future exploration.
Overall, BPP offers a pragmatic bridge between the need for long‑term memory in robot policies and the practical constraints of data collection and model robustness, making it a compelling tool for developers building the next generation of autonomous manipulators.
Authors
- Max Sobol Mark
- Jacky Liang
- Maria Attarian
- Chuyuan Fu
- Debidatta Dwibedi
- Dhruv Shah
- Aviral Kumar
Paper Information
- arXiv ID: 2602.15010v1
- Categories: cs.RO, cs.LG
- Published: February 16, 2026
- PDF: Download PDF