[Paper] Open-Vocabulary 3D Instruction Ambiguity Detection

Published: (January 9, 2026 at 01:17 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.05991v1

Overview

The paper introduces Open‑Vocabulary 3D Instruction Ambiguity Detection, a brand‑new task that asks a model to decide whether a natural‑language command can be interpreted in exactly one way inside a given 3D environment. By building the Ambi3D benchmark (≈700 scenes, ~22 k instructions) and a two‑stage detection system called AmbiVer, the authors expose a blind spot in current embodied‑AI pipelines: they assume instructions are crystal‑clear and jump straight to execution, which is risky in safety‑critical domains such as surgery, robotics, or autonomous navigation.

Key Contributions

  • Task definition – Formalizes “open‑vocabulary 3D instruction ambiguity detection,” shifting focus from execution to verification.
  • Ambi3D benchmark – Large‑scale dataset with diverse indoor/outdoor 3D scenes and human‑written instructions, each labeled as ambiguous or unambiguous.
  • Empirical gap analysis – Shows that state‑of‑the‑art 3D Large Language Models (LLMs) and vision‑language models (VLMs) perform poorly on ambiguity detection.
  • AmbiVer framework – A two‑stage pipeline that (1) gathers multi‑view visual evidence from the scene and (2) feeds this evidence to a VLM to decide ambiguity.
  • Open resources – Code, data, and evaluation scripts released publicly for reproducibility and community extension.

Methodology

  1. Scene & Instruction Pairing – Each 3D scene is rendered from several camera viewpoints. Human annotators write free‑form commands (e.g., “Pick up the red bottle”) and label whether the command uniquely identifies an object/action in that scene.
  2. Baseline Models – The authors test existing 3D‑LLMs (e.g., CLIP‑based models, Point‑BERT) that directly ingest the instruction and a single scene representation.
  3. AmbiVer Two‑Stage Design
    • Evidence Collection – A lightweight visual search module samples a set of candidate objects/views that could satisfy the instruction, producing a small gallery of image patches.
    • VLM Reasoning – A pretrained vision‑language model (e.g., BLIP‑2, Flamingo) receives the instruction plus the collected visual evidence and outputs a binary “ambiguous / unambiguous” decision, effectively grounding the language in concrete visual cues before judging clarity.
  4. Training & Evaluation – The VLM is fine‑tuned on the Ambi3D training split with cross‑entropy loss; performance is measured by accuracy, precision/recall on the held‑out test set.

Results & Findings

ModelAccuracy (Ambi3D)
3D‑LLM baseline (single view)~58 %
VLM with single view~62 %
AmbiVer (two‑stage)78 %
Human upper bound~92 %
  • Baseline struggle: Even the strongest 3D‑LLMs misclassify nearly half of ambiguous commands, confirming that current embodied agents lack a “self‑check” before acting.
  • Evidence matters: Providing multiple visual perspectives boosts VLM performance by ~10 % absolute, demonstrating that ambiguity often hinges on hidden objects or occlusions.
  • Error patterns: Most failures involve subtle spatial relations (“to the left of the chair”) or synonyms (“vial” vs. “bottle”), suggesting future work should improve relational reasoning and lexical grounding.

Practical Implications

  • Safety‑critical robotics: Before a robot executes a hand‑off command in a lab or operating room, an ambiguity detector can flag uncertain instructions, prompting clarification from a human operator.
  • Voice‑controlled assistants: Smart home devices could ask follow‑up questions (“Did you mean the blue mug on the top shelf?”) instead of acting on vague commands, reducing user frustration.
  • Autonomous navigation: Drones receiving high‑level goals (“Inspect the tower”) can verify that the target is uniquely identifiable in the current 3D map, avoiding wasted flights.
  • Human‑in‑the‑loop AI: Embodied agents equipped with AmbiVer can adopt a “verify‑then‑act” workflow, improving trustworthiness and compliance with regulatory standards for AI safety.

Limitations & Future Work

  • Scene diversity: Ambi3D focuses mainly on indoor, synthetic environments; real‑world outdoor or cluttered settings may expose new ambiguity modes.
  • Language coverage: The benchmark uses English instructions; multilingual or domain‑specific vocabularies (medical jargon, industrial terminology) remain untested.
  • Scalability of evidence collection: The current multi‑view sampling is heuristic; scaling to high‑resolution, large‑scale environments could incur latency.
  • Future directions: Integrating relational graph reasoning, expanding to video‑based instructions, and exploring active clarification dialogs are promising next steps.

Authors

  • Jiayu Ding
  • Haoran Tang
  • Ge Li

Paper Information

  • arXiv ID: 2601.05991v1
  • Categories: cs.AI
  • Published: January 9, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »