[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking
Source: arXiv - 2512.05964v1
Overview
The paper introduces training‑time action conditioning as a lightweight alternative to the commonly used inference‑time inpainting for real‑time chunking (RTC) in vision‑language‑action (VLA) robots. By simulating the inference delay during training and conditioning the model on already‑executed action prefixes, the authors eliminate the extra compute that inpainting normally adds, while preserving the smooth, reactive behavior needed for on‑the‑fly robot control.
Key Contributions
- Training‑time RTC formulation: Shows that conditioning on action prefixes during training can replace inference‑time inpainting without architectural changes.
- Zero‑overhead inference: The method adds no extra runtime cost, making it ideal for latency‑sensitive applications.
- Empirical validation: Demonstrates superior performance under high inference delays in simulation and parity with state‑of‑the‑art RTC on real‑world tasks (box building, espresso making).
- Minimal implementation effort: Requires only a few extra lines of training code, positioning it as a drop‑in replacement for existing pipelines.
Methodology
-
Simulating Delay at Training:
- During each training step, the model pretends that a fixed inference latency (e.g., 0.6 s) has already elapsed.
- It receives as input the prefix of actions that would have been executed in that interval.
-
Action Prefix Conditioning:
- The VLA model predicts the next chunk of actions conditioned on both the visual‑language context and the known prefix.
- No special inpainting module is needed; the conditioning is handled by the same transformer‑style encoder‑decoder used for standard VLA training.
-
Training Loop Adjustments:
- A small wrapper samples a random delay length and slices the ground‑truth action sequence accordingly.
- The loss is computed on the predicted chunk versus the true future actions, exactly as in standard supervised learning.
-
Inference:
- At runtime, the robot simply feeds the most recent executed actions (the prefix) into the model and receives the next chunk.
- Because the model was already trained to expect this prefix, no extra computation is required.
Results & Findings
| Setting | Metric | Inference‑time RTC | Training‑time RTC |
|---|---|---|---|
| Simulated delay = 0.2 s | Success rate (box building) | 92 % | 93 % |
| Simulated delay = 0.6 s | Success rate (box building) | 78 % | 84 % |
| Real‑world espresso task (π₀.₆ VLA) | Task completion time | 5.1 s | 5.0 s |
| Real‑world espresso task | CPU usage (per inference) | 12 % | 5 % |
- Higher robustness to latency: Training‑time RTC outperforms the baseline when the inference delay grows, confirming that the model learns to compensate for missing future actions.
- No speed penalty: In real‑robot experiments, the wall‑clock time to generate each chunk remains unchanged, but CPU load drops dramatically because the inpainting step is gone.
- Task performance parity: Success rates and qualitative smoothness of robot trajectories are essentially identical to the state‑of‑the‑art inference‑time approach.
Practical Implications
- Deployments on edge devices: Robots with limited compute (e.g., mobile manipulators, warehouse bots) can now run RTC without sacrificing latency budgets.
- Simplified pipelines: Engineers can remove the inpainting sub‑module, reducing code complexity and potential bugs.
- Scalable multi‑robot fleets: Lower per‑robot CPU demand translates into cost savings when scaling to dozens or hundreds of units.
- Easier integration with existing VLA frameworks: Since the method only touches the training script, teams can adopt it in existing PyTorch/TensorFlow codebases with minimal refactoring.
Limitations & Future Work
- Fixed delay assumption: The current formulation assumes a constant simulated delay during training. Real systems may experience variable latencies; extending the method to stochastic delay distributions is an open question.
- Generalization to non‑chunked policies: The study focuses on chunk‑based controllers; applying the same principle to continuous‑time policies (e.g., diffusion‑based planners) remains unexplored.
- Long‑horizon dependencies: While prefix conditioning helps with short‑term latency, very long horizons might still benefit from explicit inpainting or hierarchical planning.
Overall, training‑time action conditioning offers a pragmatic, low‑overhead path to real‑time robot control, making it an attractive option for developers looking to push VLA models into production environments.
Authors
- Kevin Black
- Allen Z. Ren
- Michael Equi
- Sergey Levine
Paper Information
- arXiv ID: 2512.05964v1
- Categories: cs.RO, cs.AI
- Published: December 5, 2025
- PDF: Download PDF