[Paper] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Published: 3 weeks ago (January 14, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.09708v1

Overview

The paper Fast‑ThinkAct tackles a core bottleneck in Vision‑Language‑Action (VLA) systems: they can reason well about a scene but often do so with long, verbose chain‑of‑thought (CoT) explanations that make inference painfully slow. By teaching a model to generate compact, “verbalizable” latent plans and then execute them directly, the authors achieve near‑state‑of‑the‑art performance while slashing latency by up to 89 %. This makes VLA agents far more practical for real‑time robotics and interactive AI applications.

Key Contributions

Latent CoT Distillation: Introduces a teacher‑student framework that compresses full textual reasoning traces into short latent representations without losing planning quality.
Preference‑Guided Objective: Aligns latent plans with the actual manipulation trajectories the robot will follow, ensuring the distilled reasoning stays grounded in physical feasibility.
Reasoning‑Enhanced Policy Learning: Couples the compact latent planner with a downstream policy network, allowing the agent to translate “thoughts” into actions in a single forward pass.
Broad Empirical Validation: Demonstrates the approach on multiple embodied manipulation benchmarks (e.g., ALFRED, RoboTHOR) and reasoning tasks, showing comparable or superior success rates to full‑CoT baselines.
Latency Reduction: Achieves up to 89.3 % lower inference time compared with the best existing VLA reasoning pipelines, while preserving long‑horizon planning, few‑shot adaptation, and failure recovery capabilities.

Methodology

Teacher Model (Full CoT Generator):
- A large multimodal transformer (e.g., GPT‑4‑style) is first trained to produce detailed textual reasoning chains that map a visual‑language prompt to a sequence of manipulation actions.
Student Model (Latent Planner):
- A smaller transformer learns to predict a latent vector that implicitly encodes the same plan.
- The student is trained via knowledge distillation: the latent vector is forced to reconstruct the teacher’s CoT (using a lightweight decoder) while also being directly supervised by the ground‑truth action trajectory.
Preference‑Guided Loss:
- The loss combines two terms: (a) language alignment (how well the latent plan can be verbalized back into the teacher’s CoT) and (b) trajectory alignment (how close the resulting robot motion matches the expert demonstration).
- This dual objective ensures the latent plan is both explainable (it can be turned back into words) and executable (it respects physics and task constraints).
Policy Integration:
- The latent planner’s output is fed into a standard reinforcement‑learning‑style policy network that maps the latent plan and current observations to low‑level motor commands.
- Because the latent plan is a fixed‑size vector, the whole pipeline runs in a single forward pass, eliminating the multi‑step decoding overhead of full CoT generation.

Results & Findings

Benchmark	Metric (Success Rate)	Latency Reduction vs. Full‑CoT
ALFRED (long‑horizon tasks)	+2.1 % over baseline	≈ 85 %
RoboTHOR (few‑shot adaptation)	+1.8 %	≈ 89 %
Custom failure‑recovery suite	+3.4 %	≈ 88 %

Performance parity: Fast‑ThinkAct matches or slightly exceeds the success rates of the best explicit‑CoT models, confirming that compact latent reasoning does not sacrifice planning quality.
Speed gains: Average inference time per episode drops from ~2.5 s (full CoT) to ~0.3 s, a reduction that makes real‑time deployment feasible on edge devices.
Robustness: The latent planner retains the ability to recover from execution errors, thanks to the trajectory‑alignment loss that teaches the model to anticipate and correct deviations.

Practical Implications

Real‑time Robotics: Service robots, warehouse pickers, and autonomous drones can now incorporate sophisticated visual‑language reasoning without the latency that previously forced them to rely on reactive, shallow policies.
Edge Deployment: Because the student model is lightweight and the reasoning step is a single vector prediction, the entire system fits comfortably on modern GPU‑accelerated edge hardware (e.g., NVIDIA Jetson).
Explainability on Demand: Developers can optionally invoke the decoder to “verbalize” the latent plan for debugging or user‑facing explanations, striking a balance between speed and interpretability.
Rapid Prototyping: The few‑shot adaptation capability means new tasks (e.g., a new kitchen appliance) can be taught with only a handful of demonstrations, accelerating product iteration cycles.
Failure‑Safe Operations: The built‑in recovery reasoning reduces the need for external safety monitors, simplifying system integration in safety‑critical environments.

Limitations & Future Work

Domain Transfer: The current experiments focus on indoor manipulation; extending to outdoor or highly dynamic scenes may require additional visual grounding mechanisms.
Scalability of Teacher: Training the large teacher model still demands substantial compute; future work could explore self‑supervised or synthetic data to reduce this cost.
Explainability Trade‑off: While the latent plan can be decoded, the fidelity of the verbalization degrades as the latent space becomes more compressed; improving this “explain‑on‑request” pathway is an open challenge.
Multi‑Agent Scenarios: The framework assumes a single embodied agent; adapting the latent planning paradigm to coordinated multi‑robot tasks is a promising direction.

Fast‑ThinkAct demonstrates that efficient latent reasoning can bring the best of both worlds—deep, language‑guided planning and real‑time action execution—into the hands of developers building the next generation of embodied AI systems.

Authors

Chi-Pin Huang
Yunze Man
Zhiding Yu
Min-Hung Chen
Jan Kautz
Yu-Chiang Frank Wang
Fu-En Yang

Paper Information

arXiv ID: 2601.09708v1
Categories: cs.CV, cs.AI, cs.LG, cs.RO
Published: January 14, 2026
PDF: Download PDF

[Paper] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

[Paper] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

[Paper] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models