[Paper] Point What You Mean: Visually Grounded Instruction Policy

Published: (December 21, 2025 at 07:44 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.18933v1

Overview

The paper presents Point‑VLA, a plug‑and‑play policy that enriches language instructions for Vision‑Language‑Action (VLA) agents with explicit visual cues such as bounding‑box “points”. By giving the model a pixel‑level hint about which object to act on, the system dramatically reduces referential ambiguity—especially in cluttered or out‑of‑distribution (OOD) environments—while keeping the underlying VLA architecture unchanged.

Key Contributions

  • Visually grounded instruction policy: Introduces a lightweight “point‑and‑tell” interface that couples natural‑language commands with bounding‑box coordinates.
  • Automatic annotation pipeline: Builds a scalable dataset of paired language‑plus‑point instructions with minimal human labeling, leveraging pretrained object detectors and language models.
  • Plug‑and‑play design: Point‑VLA can be dropped into any existing text‑only VLA model (e.g., CLIP‑based policies) without retraining the visual encoder.
  • Robust empirical gains: Shows consistent performance improvements on real‑world referring tasks, particularly under heavy visual clutter and on unseen object categories.
  • Generalization analysis: Demonstrates that pixel‑level grounding helps the policy extrapolate to novel scenes and objects better than pure text prompts.

Methodology

  1. Base VLA model – The authors start from a standard Vision‑Language‑Action architecture that consumes an RGB frame and a textual instruction, then outputs low‑level control (e.g., robot arm velocities).
  2. Point augmentation – During inference, the user (or an upstream perception module) supplies a bounding box around the target object. The box coordinates are encoded as a small 2‑D positional embedding and concatenated with the language token embeddings.
  3. Training data generation
    • A pretrained object detector scans large video‑instruction datasets and proposes candidate boxes.
    • A language model rewrites the original instruction to reference the detected object (e.g., “pick up the red mug” → “pick up the red mug inside box #3”).
    • Only a tiny verification step by a human annotator is needed to filter out obvious errors, keeping the pipeline cheap.
  4. Fine‑tuning – The augmented instruction (text + point) is fed to the VLA policy, which is fine‑tuned on the newly created dataset. Because the visual encoder is frozen, training is fast and memory‑efficient.

Results & Findings

ScenarioText‑only VLAPoint‑VLA (Ours)Relative ↑
Clean tabletop (in‑distribution)78 % success86 % success+8 %
Cluttered kitchen (OOD objects)45 % success68 % success+23 %
Novel object categories (never seen in training)31 % success55 % success+24 %
  • Success metric: task‑completion rate (e.g., “pick up the target”, “push the correct block”).
  • Ablation: Removing the point embedding drops performance back to the text‑only baseline, confirming that the visual cue is the driver of improvement.
  • Generalization: Point‑VLA maintains >60 % success on scenes with completely new layouts, whereas the baseline collapses below 40 %.

Practical Implications

  • Robotics UI: Developers can build simple “click‑to‑act” interfaces for tele‑operation or assistive robots—users just click on the target in a camera feed, and the robot executes the command reliably.
  • Data‑efficient scaling: The automatic annotation pipeline means you can generate thousands of grounded instructions from existing video logs without costly manual labeling, accelerating product development cycles.
  • Improved safety: By explicitly pointing to the intended object, the system reduces accidental interactions with nearby items—a critical factor for household or warehouse robots.
  • Cross‑modal debugging: The bounding‑box overlay provides an interpretable hook for developers to see exactly what the policy is attending to, simplifying troubleshooting of failure cases.

Limitations & Future Work

  • Dependence on detector quality: If the upstream object detector mis‑localizes or fails to detect an object, the policy inherits that error.
  • Bounding‑box granularity: Very small or heavily occluded items still pose challenges; richer masks or keypoint cues could help.
  • Human‑in‑the‑loop requirement: While the annotation pipeline is cheap, fully autonomous generation of high‑quality points in the wild remains an open problem.
  • Future directions suggested by the authors include exploring multimodal points (e.g., depth or segmentation masks), extending to multi‑object instructions, and integrating learned attention mechanisms that can infer points from ambiguous language when a detector is unavailable.

Authors

  • Hang Yu
  • Juntu Zhao
  • Yufeng Liu
  • Kaiyu Li
  • Cheng Ma
  • Di Zhang
  • Yingdong Hu
  • Guang Chen
  • Junyuan Xie
  • Junliang Guo
  • Junqiao Zhao
  • Yang Gao

Paper Information

  • arXiv ID: 2512.18933v1
  • Categories: cs.CV, cs.RO
  • Published: December 22, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »