[Paper] Open-Vocabulary 3D Instruction Ambiguity Detection
Source: arXiv - 2601.05991v1
Overview
The paper introduces Open‑Vocabulary 3D Instruction Ambiguity Detection, a brand‑new task that asks a model to decide whether a natural‑language command can be interpreted in exactly one way inside a given 3D environment. By building the Ambi3D benchmark (≈700 scenes, ~22 k instructions) and a two‑stage detection system called AmbiVer, the authors expose a blind spot in current embodied‑AI pipelines: they assume instructions are crystal‑clear and jump straight to execution, which is risky in safety‑critical domains such as surgery, robotics, or autonomous navigation.
Key Contributions
- Task definition – Formalizes “open‑vocabulary 3D instruction ambiguity detection,” shifting focus from execution to verification.
- Ambi3D benchmark – Large‑scale dataset with diverse indoor/outdoor 3D scenes and human‑written instructions, each labeled as ambiguous or unambiguous.
- Empirical gap analysis – Shows that state‑of‑the‑art 3D Large Language Models (LLMs) and vision‑language models (VLMs) perform poorly on ambiguity detection.
- AmbiVer framework – A two‑stage pipeline that (1) gathers multi‑view visual evidence from the scene and (2) feeds this evidence to a VLM to decide ambiguity.
- Open resources – Code, data, and evaluation scripts released publicly for reproducibility and community extension.
Methodology
- Scene & Instruction Pairing – Each 3D scene is rendered from several camera viewpoints. Human annotators write free‑form commands (e.g., “Pick up the red bottle”) and label whether the command uniquely identifies an object/action in that scene.
- Baseline Models – The authors test existing 3D‑LLMs (e.g., CLIP‑based models, Point‑BERT) that directly ingest the instruction and a single scene representation.
- AmbiVer Two‑Stage Design
- Evidence Collection – A lightweight visual search module samples a set of candidate objects/views that could satisfy the instruction, producing a small gallery of image patches.
- VLM Reasoning – A pretrained vision‑language model (e.g., BLIP‑2, Flamingo) receives the instruction plus the collected visual evidence and outputs a binary “ambiguous / unambiguous” decision, effectively grounding the language in concrete visual cues before judging clarity.
- Training & Evaluation – The VLM is fine‑tuned on the Ambi3D training split with cross‑entropy loss; performance is measured by accuracy, precision/recall on the held‑out test set.
Results & Findings
| Model | Accuracy (Ambi3D) |
|---|---|
| 3D‑LLM baseline (single view) | ~58 % |
| VLM with single view | ~62 % |
| AmbiVer (two‑stage) | 78 % |
| Human upper bound | ~92 % |
- Baseline struggle: Even the strongest 3D‑LLMs misclassify nearly half of ambiguous commands, confirming that current embodied agents lack a “self‑check” before acting.
- Evidence matters: Providing multiple visual perspectives boosts VLM performance by ~10 % absolute, demonstrating that ambiguity often hinges on hidden objects or occlusions.
- Error patterns: Most failures involve subtle spatial relations (“to the left of the chair”) or synonyms (“vial” vs. “bottle”), suggesting future work should improve relational reasoning and lexical grounding.
Practical Implications
- Safety‑critical robotics: Before a robot executes a hand‑off command in a lab or operating room, an ambiguity detector can flag uncertain instructions, prompting clarification from a human operator.
- Voice‑controlled assistants: Smart home devices could ask follow‑up questions (“Did you mean the blue mug on the top shelf?”) instead of acting on vague commands, reducing user frustration.
- Autonomous navigation: Drones receiving high‑level goals (“Inspect the tower”) can verify that the target is uniquely identifiable in the current 3D map, avoiding wasted flights.
- Human‑in‑the‑loop AI: Embodied agents equipped with AmbiVer can adopt a “verify‑then‑act” workflow, improving trustworthiness and compliance with regulatory standards for AI safety.
Limitations & Future Work
- Scene diversity: Ambi3D focuses mainly on indoor, synthetic environments; real‑world outdoor or cluttered settings may expose new ambiguity modes.
- Language coverage: The benchmark uses English instructions; multilingual or domain‑specific vocabularies (medical jargon, industrial terminology) remain untested.
- Scalability of evidence collection: The current multi‑view sampling is heuristic; scaling to high‑resolution, large‑scale environments could incur latency.
- Future directions: Integrating relational graph reasoning, expanding to video‑based instructions, and exploring active clarification dialogs are promising next steps.
Authors
- Jiayu Ding
- Haoran Tang
- Ge Li
Paper Information
- arXiv ID: 2601.05991v1
- Categories: cs.AI
- Published: January 9, 2026
- PDF: Download PDF