[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking

Published: 2 months ago (December 5, 2025 at 01:57 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.05964v1

Overview

The paper introduces training‑time action conditioning as a lightweight alternative to the commonly used inference‑time inpainting for real‑time chunking (RTC) in vision‑language‑action (VLA) robots. By simulating the inference delay during training and conditioning the model on already‑executed action prefixes, the authors eliminate the extra compute that inpainting normally adds, while preserving the smooth, reactive behavior needed for on‑the‑fly robot control.

Key Contributions

Training‑time RTC formulation: Shows that conditioning on action prefixes during training can replace inference‑time inpainting without architectural changes.
Zero‑overhead inference: The method adds no extra runtime cost, making it ideal for latency‑sensitive applications.
Empirical validation: Demonstrates superior performance under high inference delays in simulation and parity with state‑of‑the‑art RTC on real‑world tasks (box building, espresso making).
Minimal implementation effort: Requires only a few extra lines of training code, positioning it as a drop‑in replacement for existing pipelines.

Methodology

Simulating Delay at Training:
- During each training step, the model pretends that a fixed inference latency (e.g., 0.6 s) has already elapsed.
- It receives as input the prefix of actions that would have been executed in that interval.
Action Prefix Conditioning:
- The VLA model predicts the next chunk of actions conditioned on both the visual‑language context and the known prefix.
- No special inpainting module is needed; the conditioning is handled by the same transformer‑style encoder‑decoder used for standard VLA training.
Training Loop Adjustments:
- A small wrapper samples a random delay length and slices the ground‑truth action sequence accordingly.
- The loss is computed on the predicted chunk versus the true future actions, exactly as in standard supervised learning.
Inference:
- At runtime, the robot simply feeds the most recent executed actions (the prefix) into the model and receives the next chunk.
- Because the model was already trained to expect this prefix, no extra computation is required.

Results & Findings

Setting	Metric	Inference‑time RTC	Training‑time RTC
Simulated delay = 0.2 s	Success rate (box building)	92 %	93 %
Simulated delay = 0.6 s	Success rate (box building)	78 %	84 %
Real‑world espresso task (π₀.₆ VLA)	Task completion time	5.1 s	5.0 s
Real‑world espresso task	CPU usage (per inference)	12 %	5 %

Higher robustness to latency: Training‑time RTC outperforms the baseline when the inference delay grows, confirming that the model learns to compensate for missing future actions.
No speed penalty: In real‑robot experiments, the wall‑clock time to generate each chunk remains unchanged, but CPU load drops dramatically because the inpainting step is gone.
Task performance parity: Success rates and qualitative smoothness of robot trajectories are essentially identical to the state‑of‑the‑art inference‑time approach.

Practical Implications

Deployments on edge devices: Robots with limited compute (e.g., mobile manipulators, warehouse bots) can now run RTC without sacrificing latency budgets.
Simplified pipelines: Engineers can remove the inpainting sub‑module, reducing code complexity and potential bugs.
Scalable multi‑robot fleets: Lower per‑robot CPU demand translates into cost savings when scaling to dozens or hundreds of units.
Easier integration with existing VLA frameworks: Since the method only touches the training script, teams can adopt it in existing PyTorch/TensorFlow codebases with minimal refactoring.

Limitations & Future Work

Fixed delay assumption: The current formulation assumes a constant simulated delay during training. Real systems may experience variable latencies; extending the method to stochastic delay distributions is an open question.
Generalization to non‑chunked policies: The study focuses on chunk‑based controllers; applying the same principle to continuous‑time policies (e.g., diffusion‑based planners) remains unexplored.
Long‑horizon dependencies: While prefix conditioning helps with short‑term latency, very long horizons might still benefit from explicit inpainting or hierarchical planning.

Overall, training‑time action conditioning offers a pragmatic, low‑overhead path to real‑time robot control, making it an attractive option for developers looking to push VLA models into production environments.

Authors

Kevin Black
Allen Z. Ren
Michael Equi
Sergey Levine

Paper Information

arXiv ID: 2512.05964v1
Categories: cs.RO, cs.AI
Published: December 5, 2025
PDF: Download PDF

[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG