[Paper] Point What You Mean: Visually Grounded Instruction Policy

Published: 1 week ago (December 21, 2025 at 07:44 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.18933v1

Overview

The paper presents Point‑VLA, a plug‑and‑play policy that enriches language instructions for Vision‑Language‑Action (VLA) agents with explicit visual cues such as bounding‑box “points”. By giving the model a pixel‑level hint about which object to act on, the system dramatically reduces referential ambiguity—especially in cluttered or out‑of‑distribution (OOD) environments—while keeping the underlying VLA architecture unchanged.

Key Contributions

Visually grounded instruction policy: Introduces a lightweight “point‑and‑tell” interface that couples natural‑language commands with bounding‑box coordinates.
Automatic annotation pipeline: Builds a scalable dataset of paired language‑plus‑point instructions with minimal human labeling, leveraging pretrained object detectors and language models.
Plug‑and‑play design: Point‑VLA can be dropped into any existing text‑only VLA model (e.g., CLIP‑based policies) without retraining the visual encoder.
Robust empirical gains: Shows consistent performance improvements on real‑world referring tasks, particularly under heavy visual clutter and on unseen object categories.
Generalization analysis: Demonstrates that pixel‑level grounding helps the policy extrapolate to novel scenes and objects better than pure text prompts.

Methodology

Base VLA model – The authors start from a standard Vision‑Language‑Action architecture that consumes an RGB frame and a textual instruction, then outputs low‑level control (e.g., robot arm velocities).
Point augmentation – During inference, the user (or an upstream perception module) supplies a bounding box around the target object. The box coordinates are encoded as a small 2‑D positional embedding and concatenated with the language token embeddings.
Training data generation –
- A pretrained object detector scans large video‑instruction datasets and proposes candidate boxes.
- A language model rewrites the original instruction to reference the detected object (e.g., “pick up the red mug” → “pick up the red mug inside box #3”).
- Only a tiny verification step by a human annotator is needed to filter out obvious errors, keeping the pipeline cheap.
Fine‑tuning – The augmented instruction (text + point) is fed to the VLA policy, which is fine‑tuned on the newly created dataset. Because the visual encoder is frozen, training is fast and memory‑efficient.

Results & Findings

Scenario	Text‑only VLA	Point‑VLA (Ours)	Relative ↑
Clean tabletop (in‑distribution)	78 % success	86 % success	+8 %
Cluttered kitchen (OOD objects)	45 % success	68 % success	+23 %
Novel object categories (never seen in training)	31 % success	55 % success	+24 %

Success metric: task‑completion rate (e.g., “pick up the target”, “push the correct block”).
Ablation: Removing the point embedding drops performance back to the text‑only baseline, confirming that the visual cue is the driver of improvement.
Generalization: Point‑VLA maintains >60 % success on scenes with completely new layouts, whereas the baseline collapses below 40 %.

Practical Implications

Robotics UI: Developers can build simple “click‑to‑act” interfaces for tele‑operation or assistive robots—users just click on the target in a camera feed, and the robot executes the command reliably.
Data‑efficient scaling: The automatic annotation pipeline means you can generate thousands of grounded instructions from existing video logs without costly manual labeling, accelerating product development cycles.
Improved safety: By explicitly pointing to the intended object, the system reduces accidental interactions with nearby items—a critical factor for household or warehouse robots.
Cross‑modal debugging: The bounding‑box overlay provides an interpretable hook for developers to see exactly what the policy is attending to, simplifying troubleshooting of failure cases.

Limitations & Future Work

Dependence on detector quality: If the upstream object detector mis‑localizes or fails to detect an object, the policy inherits that error.
Bounding‑box granularity: Very small or heavily occluded items still pose challenges; richer masks or keypoint cues could help.
Human‑in‑the‑loop requirement: While the annotation pipeline is cheap, fully autonomous generation of high‑quality points in the wild remains an open problem.
Future directions suggested by the authors include exploring multimodal points (e.g., depth or segmentation masks), extending to multi‑object instructions, and integrating learned attention mechanisms that can infer points from ambiguous language when a detector is unavailable.

Authors

Hang Yu
Juntu Zhao
Yufeng Liu
Kaiyu Li
Cheng Ma
Di Zhang
Yingdong Hu
Guang Chen
Junyuan Xie
Junliang Guo
Junqiao Zhao
Yang Gao

Paper Information

arXiv ID: 2512.18933v1
Categories: cs.CV, cs.RO
Published: December 22, 2025
PDF: Download PDF

[Paper] Point What You Mean: Visually Grounded Instruction Policy

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model