[Paper] When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

Published: 3 days ago (June 7, 2026 at 05:49 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.08542v1

Overview

Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.

Key Contributions

This paper presents research in the following areas:

cs.RO
cs.AI
cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.RO.

Authors

Haizhou Ge
Yufei Jia
Yue Li
Zhixing Chen
Lu Shi
Lei Han
Guyue Zhou
Ruqi Huang

Paper Information

arXiv ID: 2606.08542v1
Categories: cs.RO, cs.AI, cs.CV
Published: June 7, 2026
PDF: Download PDF

[Paper] When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

[Paper] Multimodal Brain Tumour Classification Using Feature Fusion

[Paper] FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

[Paper] A History-Aware Visually Grounded Critic for Computer Use Agents