[Paper] PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Published: 6 days ago (June 4, 2026 at 01:59 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.06485v1

Overview

Recent advances in 3D multimodal large language models (3D‑MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D‑MLLMs remain largely object‑centric, limiting their ability to model fine‑grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part‑aware 3D‑MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes.

To enable training and evaluation of part‑aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part‑level annotations and language instructions. We further develop Part‑Aware 3D Representation Learning to enrich 3D visual representations with fine‑grained part‑level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object‑part queries. Extensive experiments show that our method substantially improves part‑level question answering and referring segmentation, while also achieving strong performance across object‑level vision‑language tasks.

Key Contributions

Introduces a part‑aware 3D‑MLLM framework (PAR3D).
Provides the ScenePart dataset with part‑level annotations and language instructions.
Develops part‑aware representation learning and hierarchical segmentation query generation.
Demonstrates significant improvements on part‑level QA and referring segmentation, plus strong object‑level performance.
Categorized under cs.CV.

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of computer vision (cs.CV) by enabling more fine‑grained interaction with 3D environments.

Authors

Shaohui Dai
Yansong Qu
You Shen
Shengchuan Zhang
Liujuan Cao

Paper Information

arXiv ID: 2606.06485v1
Categories: cs.CV
Published: June 4, 2026
PDF: Download PDF

[Paper] PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] UniSHARP: Universal Sharp Monocular View Synthesis

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Streaming Video Generation with Streaming Force Control

[Paper] Differences in Detection: Explainability Where it Matters