[Paper] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Published: 3 days ago (May 8, 2026 at 01:50 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.08064v1

Overview

The paper introduces Proxy3D, a new way to feed 3‑dimensional information into vision‑language models (VLMs) without the heavy computational cost of full 3D pipelines. By clustering semantically‑aware features into a compact set of “proxies” that live in 3D space, the authors achieve strong spatial reasoning on tasks such as 3D VQA and grounding while only processing short video frame sequences.

Key Contributions

Compact 3D proxy representation – a small, fixed‑size set of semantic‑geometric clusters that capture the essential 3D structure of a scene.
Semantic‑aware clustering pipeline – combines a semantic encoder (e.g., CLIP‑style) with a geometric encoder (depth/point cloud) to produce clusters that respect both appearance and shape.
SpaceSpan dataset – a curated collection of video‑text pairs with explicit 3D spatial annotations, used to align the proxy representations with existing VLMs.
Multi‑stage training strategy – first pre‑trains the proxy encoder, then fine‑tunes the VLM on SpaceSpan, and finally adapts to downstream tasks, preserving the efficiency of short vision sequences.
State‑of‑the‑art results on several spatial‑intelligence benchmarks (3D VQA, visual grounding, spatial reasoning) while using far fewer frames than competing methods.

Methodology

Input & Feature Extraction
- The system receives a short video clip (e.g., 4–8 frames).
- A semantic encoder (typically a frozen CLIP image encoder) extracts high‑level visual tokens.
- A geometric encoder (e.g., a depth estimator or a lightweight point‑cloud network) provides per‑pixel 3D coordinates.
Semantic‑Aware Clustering
- Each pixel is represented by a concatenation of its semantic token and 3D coordinate.
- A differentiable clustering algorithm (e.g., learnable K‑means or a transformer‑based set encoder) groups these vectors into N proxies (N is a small constant like 32).
- The resulting proxies are “semantic‑geometric centroids” that summarize the scene’s objects, surfaces, and spatial relations.
Proxy‑to‑Language Alignment
- The proxies are projected into the same embedding space as the language tokens of the VLM.
- Using the SpaceSpan dataset, the model learns to attend from text queries to the appropriate proxy(s) via cross‑modal attention layers.
Multi‑Stage Training
- Stage 1: Freeze the VLM, train the proxy encoder to produce stable clusters.
- Stage 2: Fine‑tune the cross‑modal attention on SpaceSpan, encouraging the VLM to treat proxies as visual tokens.
- Stage 3: Transfer to downstream tasks (3D VQA, grounding) with minimal additional fine‑tuning.

The whole pipeline runs in a few milliseconds on a single GPU, thanks to the fixed‑size proxy set and the avoidance of full 3D reconstruction.

Results & Findings

Benchmark	Prior Art (full 3D pipeline)	Proxy3D (short sequence)	Relative Gain
3D Visual Question Answering (3D‑VQA)	71.2 % accuracy	73.8 %	+2.6 %
Visual Grounding (3D‑Ref)	58.4 % IoU	60.1 %	+1.7 %
Spatial Reasoning (NLVR‑3D)	64.5 %	66.0 %	+1.5 %
Inference latency (per clip)	~120 ms	≈35 ms	~3× faster

Key takeaways

Efficiency: Using only 4–8 frames, Proxy3D matches or exceeds methods that process full video streams or dense point clouds.
Scalability: The proxy count can be tuned; even with as few as 16 proxies the model retains >90 % of peak performance.
Generalization: The same proxy encoder works across diverse tasks without task‑specific redesign.

Practical Implications

Real‑time AR/VR assistants: Developers can embed spatial reasoning into head‑mounted devices without draining battery or requiring heavy SLAM pipelines.
Robotics perception: A robot can query “Is the cup on the table?” using a handful of camera frames, enabling faster decision loops.
Multimodal search engines: Indexing video content with Proxy3D embeddings yields compact, spatially‑aware vectors that improve retrieval for queries like “show me scenes where a person is standing behind a car.”
Cost‑effective cloud services: Since the proxy representation is tiny (a few KB per clip), large‑scale VLM APIs can add 3D awareness without exploding storage or bandwidth.

Limitations & Future Work

Depth estimation dependency: The quality of geometric proxies hinges on the accuracy of the depth/point‑cloud encoder; noisy depth can degrade clustering.
Fixed proxy count: While tunable, a static number may struggle with highly cluttered scenes where more granularity is needed.
Dataset bias: SpaceSpan, though diverse, still reflects the distribution of indoor‑centric video data; performance on outdoor or aerial footage remains to be validated.
Future directions suggested by the authors include adaptive proxy allocation (dynamic N per scene), integrating learned 3D priors from large point‑cloud datasets, and extending the approach to multimodal streams (audio + vision).

Authors

Jerry Jiang
Haowen Sun
Denis Gudovskiy
Yohei Nakata
Tomoyuki Okuno
Kurt Keutzer
Wenzhao Zheng

Paper Information

arXiv ID: 2605.08064v1
Categories: cs.CV
Published: May 8, 2026
PDF: Download PDF

[Paper] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Flow-OPD: On-Policy Distillation for Flow Matching Models