[Paper] BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames

Published: (February 16, 2026 at 01:49 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.15010v1

Overview

The paper introduces Big Picture Policies (BPP), a new way to teach robots to remember the right moments from their past observations without getting confused by irrelevant details. By letting a vision‑language model pick out a handful of “keyframes” that actually matter for a task, BPP lets imitation‑learning policies reason over long histories while staying robust when deployed in the real world.

Key Contributions

  • Identifies the root cause of spurious history‑dependence: limited coverage of possible observation sequences during training leads policies to over‑fit to accidental cues.
  • Proposes a compact “keyframe” representation: a vision‑language model extracts a minimal set of task‑relevant frames from any rollout, dramatically shrinking the history space.
  • Integrates keyframes into imitation learning: the policy conditions on these selected frames instead of the full raw history, preserving expressivity while improving generalization.
  • Extensive empirical validation: experiments on 4 real‑world manipulation tasks (e.g., object search, multi‑step assembly) and 3 simulated benchmarks show up to 70 % higher success rates versus strong baselines.
  • Open‑source implementation and dataset (released alongside the paper) to enable reproducibility and further research.

Methodology

  1. Collect Demonstrations – Standard tele‑operated or scripted trajectories are recorded as sequences of RGB‑D images, robot states, and language instructions.
  2. Keyframe Detection – A pretrained vision‑language model (e.g., CLIP‑based) scores each frame for relevance to the task description. The top‑K frames (typically 3–5) are kept as the big picture of the episode.
  3. History Projection – Instead of feeding the entire raw observation stream to the policy, the selected keyframes are encoded (image + language embeddings) and concatenated with the current observation.
  4. Imitation Learning – A standard behavior‑cloning loss is applied on this compact representation. No extra regularization is needed because the keyframe set already mitigates distribution shift.
  5. Deployment – At runtime the robot continuously re‑evaluates the relevance scores and updates its keyframe buffer, ensuring that the policy always conditions on the most informative past moments.

The approach is deliberately simple: it swaps a huge, noisy history for a tiny, semantically meaningful one, letting existing imitation‑learning pipelines work unchanged.

Results & Findings

EnvironmentBaseline (BC)BPPRelative Gain
Real‑world “Find‑and‑Pick” (kitchen)32 %85 %+53 %
Real‑world “Drawer‑Open‑Then‑Place”41 %78 %+37 %
Simulated “Multi‑Object Sorting”58 %71 %+13 %
Simulated “Long‑Horizon Assembly”44 %66 %+22 %

Key observations:

  • Robustness to out‑of‑distribution histories – When test trajectories contain novel ordering of sub‑tasks, BPP’s success drops far less than baselines.
  • Ablation of keyframe count – Using 1–2 frames hurts performance (insufficient context), while >6 frames yields diminishing returns and re‑introduces spurious correlations.
  • Comparison with regularization tricks (e.g., dropout, data augmentation) – Those methods improve marginally, but cannot match the distribution‑coverage benefit of keyframe projection.

Practical Implications

  • Simpler data collection – Engineers no longer need to engineer exhaustive history coverage; a modest set of demonstrations suffices because BPP abstracts away irrelevant variations.
  • Lower compute and memory footprint – Policies only process a handful of frames per step, enabling deployment on edge devices or low‑power robot controllers.
  • Improved reliability for search‑and‑retrieve tasks – Warehouse robots, home assistants, and inspection drones can remember “where they already looked” without over‑fitting to visual quirks of the training environment.
  • Plug‑and‑play with existing pipelines – Since BPP sits on top of any imitation‑learning backbone (Transformer, CNN‑RNN, etc.), teams can adopt it without rewriting their training code.
  • Potential for hybrid human‑in‑the‑loop debugging – The extracted keyframes are human‑readable, making it easier to diagnose why a policy failed (e.g., missed a crucial intermediate state).

Limitations & Future Work

  • Dependence on a strong vision‑language model – If the pretrained model mis‑scores relevance (e.g., in highly cluttered scenes), keyframe selection can miss critical events.
  • Fixed K value – The current implementation uses a static number of keyframes; adaptive strategies could better handle tasks with highly variable lengths.
  • Limited to visual‑language cues – Tasks that rely heavily on proprioceptive or force feedback may need additional modalities for keyframe extraction.
  • Scalability to extremely long horizons – While BPP reduces history size, extremely long‑horizon tasks (e.g., multi‑room navigation) may still require hierarchical keyframe selection, which the authors leave for future exploration.

Overall, BPP offers a pragmatic bridge between the need for long‑term memory in robot policies and the practical constraints of data collection and model robustness, making it a compelling tool for developers building the next generation of autonomous manipulators.

Authors

  • Max Sobol Mark
  • Jacky Liang
  • Maria Attarian
  • Chuyuan Fu
  • Debidatta Dwibedi
  • Dhruv Shah
  • Aviral Kumar

Paper Information

  • arXiv ID: 2602.15010v1
  • Categories: cs.RO, cs.LG
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »