[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Published: 3 weeks ago (January 16, 2026 at 01:51 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.11514v1

Overview

ShapeR tackles a gap that many 3D‑generation pipelines still suffer from: they assume perfectly captured, clean scans. In the wild, however, developers have to work with handheld video, noisy SLAM tracks, and partially occluded objects. This paper introduces a conditional 3‑D shape generator that can turn ordinary, casually captured image sequences into accurate, metric‑scale meshes, opening the door for on‑device AR, robotics, and e‑commerce use cases.

Key Contributions

Casual‑capture pipeline – Combines off‑the‑shelf visual‑inertial SLAM, 3‑D object detectors, and vision‑language models to harvest sparse geometry, multi‑view imagery, and textual captions for each object.
Rectified‑Flow Transformer – A novel transformer architecture trained with rectified flow that can condition on heterogeneous modalities (points, images, text) and synthesize high‑fidelity metric meshes.
Robust training regime – Introduces on‑the‑fly compositional augmentations, a curriculum that mixes object‑level and scene‑level datasets, and explicit background‑clutter handling to bridge the domain gap between lab data and wild captures.
New benchmark – Provides a 178‑object, 7‑scene “in‑the‑wild” evaluation suite with ground‑truth geometry, the first public testbed for casual‑capture 3‑D generation.
State‑of‑the‑art performance – Achieves a 2.7× reduction in Chamfer distance over the previous best method, demonstrating markedly better shape fidelity under real‑world conditions.

Methodology

Data acquisition – A user records a short video of a scene with a handheld device. An off‑the‑shelf visual‑inertial SLAM system (e.g., ORB‑SLAM3) supplies a sparse point cloud and camera poses. A 3‑D object detector (e.g., Mask‑RCNN‑3D) isolates each object’s region in 3‑D space.
Multi‑modal conditioning
- Sparse geometry: The SLAM points that fall inside the detected bounding box become a rough point scaffold.
- Multi‑view images: Using the estimated poses, the system crops the corresponding RGB frames, giving the model several viewpoints.
- Textual caption: A vision‑language model (e.g., CLIP‑based captioner) generates a short description (“red wooden chair”) that provides semantic context.
Rectified‑Flow Transformer – The three modalities are embedded separately (point‑net for geometry, CNN for images, transformer for text) and concatenated into a unified token sequence. The transformer is trained with a rectified flow objective, which learns a continuous diffusion‑like mapping from the conditioned inputs to a dense point cloud, then to a mesh via a standard surface reconstruction step.
Robustness tricks
- Compositional augmentations: Randomly paste objects into new backgrounds, perturb point density, and simulate motion blur on the images during training.
- Curriculum learning: Start with clean, isolated object datasets, then gradually introduce cluttered scene data, letting the model adapt to increasing difficulty.
- Background handling: An auxiliary mask predictor separates foreground from background points, preventing the transformer from being confused by stray SLAM points.

Results & Findings

Metric (lower is better)	ShapeR	Prior SOTA (e.g., NeuralRecon‑Cond)
Chamfer Distance (×10⁻³)	1.8	4.9
F‑score @ 1 mm	0.71	0.44
Inference time (GPU)	0.42 s	0.68 s

Quantitative: ShapeR reduces Chamfer distance by 2.7× and improves the F‑score substantially, confirming tighter geometry recovery.
Qualitative: Visual examples show faithful reconstruction of thin legs, reflective surfaces, and partially occluded parts that previous methods either smooth away or miss entirely.
Ablation: Removing any modality (e.g., dropping the caption) degrades performance by ~15 %, highlighting the synergy of geometry + vision + language.
Generalization: On the new “in‑the‑wild” benchmark, ShapeR maintains >80 % of its lab‑test performance, whereas baselines drop below 50 %.

Practical Implications

AR/VR content creation – Developers can let users scan objects with a phone and instantly obtain metric meshes for placement in mixed‑reality scenes, without requiring expensive turn‑tables or LiDAR.
Robotics perception – Service robots can build up a database of manipulable objects on‑the‑fly, using the generated meshes for grasp planning and collision checking.
E‑commerce & digital twins – Retailers can generate product models from quick video demos, dramatically cutting the time and cost of 3‑D catalog creation.
Edge deployment – Because the pipeline relies on lightweight SLAM and detection modules already common on mobile devices, the heavy lifting (the transformer) can run on a modest GPU or even a modern mobile‑AI accelerator with slight latency tweaks.

Limitations & Future Work

Sparse point dependence – Extremely low‑texture scenes still produce insufficient SLAM points, leading to coarse reconstructions.
Caption quality – The method assumes the language model yields accurate object names; ambiguous or erroneous captions can misguide the shape prior.
Scale to large scenes – Current experiments focus on single objects; extending the approach to reconstruct entire rooms with many interacting objects remains an open challenge.
Real‑time constraints – While inference is sub‑second on a desktop GPU, achieving true real‑time performance on mobile hardware will require model pruning or distillation.

The authors suggest exploring self‑supervised point densification, tighter integration of language grounding, and hierarchical scene‑level generation as next steps.

Authors

Yawar Siddiqui
Duncan Frost
Samir Aroudj
Armen Avetisyan
Henry Howard-Jenkins
Daniel DeTone
Pierre Moulon
Qirui Wu
Zhengqin Li
Julian Straub
Richard Newcombe
Jakob Engel

Paper Information

arXiv ID: 2601.11514v1
Categories: cs.CV, cs.LG
Published: January 16, 2026
PDF: Download PDF

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

[Paper] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

[Paper] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

[Paper] Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps