[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Source: arXiv - 2601.11514v1
Overview
ShapeR tackles a gap that many 3D‑generation pipelines still suffer from: they assume perfectly captured, clean scans. In the wild, however, developers have to work with handheld video, noisy SLAM tracks, and partially occluded objects. This paper introduces a conditional 3‑D shape generator that can turn ordinary, casually captured image sequences into accurate, metric‑scale meshes, opening the door for on‑device AR, robotics, and e‑commerce use cases.
Key Contributions
- Casual‑capture pipeline – Combines off‑the‑shelf visual‑inertial SLAM, 3‑D object detectors, and vision‑language models to harvest sparse geometry, multi‑view imagery, and textual captions for each object.
- Rectified‑Flow Transformer – A novel transformer architecture trained with rectified flow that can condition on heterogeneous modalities (points, images, text) and synthesize high‑fidelity metric meshes.
- Robust training regime – Introduces on‑the‑fly compositional augmentations, a curriculum that mixes object‑level and scene‑level datasets, and explicit background‑clutter handling to bridge the domain gap between lab data and wild captures.
- New benchmark – Provides a 178‑object, 7‑scene “in‑the‑wild” evaluation suite with ground‑truth geometry, the first public testbed for casual‑capture 3‑D generation.
- State‑of‑the‑art performance – Achieves a 2.7× reduction in Chamfer distance over the previous best method, demonstrating markedly better shape fidelity under real‑world conditions.
Methodology
- Data acquisition – A user records a short video of a scene with a handheld device. An off‑the‑shelf visual‑inertial SLAM system (e.g., ORB‑SLAM3) supplies a sparse point cloud and camera poses. A 3‑D object detector (e.g., Mask‑RCNN‑3D) isolates each object’s region in 3‑D space.
- Multi‑modal conditioning
- Sparse geometry: The SLAM points that fall inside the detected bounding box become a rough point scaffold.
- Multi‑view images: Using the estimated poses, the system crops the corresponding RGB frames, giving the model several viewpoints.
- Textual caption: A vision‑language model (e.g., CLIP‑based captioner) generates a short description (“red wooden chair”) that provides semantic context.
- Rectified‑Flow Transformer – The three modalities are embedded separately (point‑net for geometry, CNN for images, transformer for text) and concatenated into a unified token sequence. The transformer is trained with a rectified flow objective, which learns a continuous diffusion‑like mapping from the conditioned inputs to a dense point cloud, then to a mesh via a standard surface reconstruction step.
- Robustness tricks
- Compositional augmentations: Randomly paste objects into new backgrounds, perturb point density, and simulate motion blur on the images during training.
- Curriculum learning: Start with clean, isolated object datasets, then gradually introduce cluttered scene data, letting the model adapt to increasing difficulty.
- Background handling: An auxiliary mask predictor separates foreground from background points, preventing the transformer from being confused by stray SLAM points.
Results & Findings
| Metric (lower is better) | ShapeR | Prior SOTA (e.g., NeuralRecon‑Cond) |
|---|---|---|
| Chamfer Distance (×10⁻³) | 1.8 | 4.9 |
| F‑score @ 1 mm | 0.71 | 0.44 |
| Inference time (GPU) | 0.42 s | 0.68 s |
- Quantitative: ShapeR reduces Chamfer distance by 2.7× and improves the F‑score substantially, confirming tighter geometry recovery.
- Qualitative: Visual examples show faithful reconstruction of thin legs, reflective surfaces, and partially occluded parts that previous methods either smooth away or miss entirely.
- Ablation: Removing any modality (e.g., dropping the caption) degrades performance by ~15 %, highlighting the synergy of geometry + vision + language.
- Generalization: On the new “in‑the‑wild” benchmark, ShapeR maintains >80 % of its lab‑test performance, whereas baselines drop below 50 %.
Practical Implications
- AR/VR content creation – Developers can let users scan objects with a phone and instantly obtain metric meshes for placement in mixed‑reality scenes, without requiring expensive turn‑tables or LiDAR.
- Robotics perception – Service robots can build up a database of manipulable objects on‑the‑fly, using the generated meshes for grasp planning and collision checking.
- E‑commerce & digital twins – Retailers can generate product models from quick video demos, dramatically cutting the time and cost of 3‑D catalog creation.
- Edge deployment – Because the pipeline relies on lightweight SLAM and detection modules already common on mobile devices, the heavy lifting (the transformer) can run on a modest GPU or even a modern mobile‑AI accelerator with slight latency tweaks.
Limitations & Future Work
- Sparse point dependence – Extremely low‑texture scenes still produce insufficient SLAM points, leading to coarse reconstructions.
- Caption quality – The method assumes the language model yields accurate object names; ambiguous or erroneous captions can misguide the shape prior.
- Scale to large scenes – Current experiments focus on single objects; extending the approach to reconstruct entire rooms with many interacting objects remains an open challenge.
- Real‑time constraints – While inference is sub‑second on a desktop GPU, achieving true real‑time performance on mobile hardware will require model pruning or distillation.
The authors suggest exploring self‑supervised point densification, tighter integration of language grounding, and hierarchical scene‑level generation as next steps.
Authors
- Yawar Siddiqui
- Duncan Frost
- Samir Aroudj
- Armen Avetisyan
- Henry Howard-Jenkins
- Daniel DeTone
- Pierre Moulon
- Qirui Wu
- Zhengqin Li
- Julian Straub
- Richard Newcombe
- Jakob Engel
Paper Information
- arXiv ID: 2601.11514v1
- Categories: cs.CV, cs.LG
- Published: January 16, 2026
- PDF: Download PDF