[Paper] ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
Source: arXiv - 2603.16866v1
Overview
ManiTwin introduces an end‑to‑end pipeline that turns a single 2‑D image into a fully simulation‑ready 3‑D object twin, complete with physical properties, language captions, functional tags, and manipulation proposals. By scaling this process to 100 000 diverse assets, the authors provide a new “plug‑and‑play” resource that can instantly feed robotic‑manipulation simulators, scene‑generation tools, and vision‑language benchmarks.
Key Contributions
- Automated asset creation: A single‑image‑to‑twin workflow that outputs mesh, texture, collision, mass, friction, and semantic annotations without manual modeling.
- ManiTwin‑100K dataset: 100 K high‑fidelity, manipulation‑ready digital twins covering everyday objects, industrial parts, and abstract shapes.
- Rich multimodal metadata: Each twin ships with natural‑language descriptions, functional labels (e.g., “graspable”, “pourable”), and a set of verified manipulation proposals (grasp poses, push trajectories).
- Open‑source pipeline & web portal: The codebase, data, and a demo UI are publicly released, enabling researchers and engineers to extend or customize the asset generation process.
- Demonstrated utility: Benchmarks show ManiTwin‑100K improves data diversity for simulation‑based policy training, random scene synthesis, and visual‑question‑answering (VQA) generation compared to prior 3‑D object collections.
Methodology
- Image Ingestion & Shape Reconstruction – A pretrained depth‑estimation network predicts a coarse point cloud from a single RGB image. The point cloud is refined with a differentiable marching‑cubes module to produce a watertight mesh.
- Physical Property Estimation – A lightweight regression model predicts mass, center‑of‑mass, and friction coefficients from visual cues (material texture, shape). These values are validated against a physics engine (PyBullet) to ensure stable simulation.
- Semantic Enrichment – A language model (GPT‑3.5‑style) generates concise object descriptions and functional tags. A separate classifier maps visual features to a taxonomy of manipulation affordances (graspable, hinge, pourable, etc.).
- Manipulation Proposal Generation – Using a grasp synthesis library (e.g., Dex‑Net) and a motion‑planning module, the pipeline samples feasible grasp poses and push trajectories, then runs a short physics rollout to verify success. Verified proposals are stored alongside the asset.
- Dataset Assembly – Assets are automatically packaged into a unified format (URDF + JSON metadata) and uploaded to a cloud bucket. A validation script checks mesh integrity, annotation completeness, and simulation stability across a random subset.
The entire pipeline runs on a single GPU workstation and can produce a new twin in ~30 seconds, making it practical for on‑demand dataset expansion.
Results & Findings
| Metric | ManiTwin‑100K vs. Prior 3‑D Collections |
|---|---|
| Mesh quality (Hausdorff distance) | 0.018 m (lower) |
| Simulation stability (collision‑free steps) | 99.2 % of assets pass 10 s physics test |
| Diversity (shape & texture entropy) | 1.35× higher than ShapeNetCore |
| Policy learning speed‑up | 2.1× fewer simulation episodes to reach 80 % success on a pick‑and‑place benchmark |
| VQA data generation | 3× more unique question‑answer pairs per object due to richer functional tags |
Qualitative inspections show that objects retain fine details (e.g., handles, hinges) and that the generated manipulation proposals are physically plausible—grasp points land on stable regions, and push trajectories respect object mass.
Practical Implications
- Robotics developers can instantly populate simulators (e.g., Isaac Gym, PyBullet) with realistic, ready‑to‑use objects, cutting months of manual asset creation.
- Simulation‑based RL pipelines benefit from richer training environments, leading to faster convergence and better transfer to real‑world robots.
- Synthetic data pipelines for computer vision (object detection, VQA, affordance prediction) gain a scalable source of labeled 3‑D scenes, reducing reliance on costly real‑world annotation.
- Product design & AR/VR teams can generate quick digital twins from catalog photos, enabling rapid prototyping of interaction scenarios.
- Open‑source community can extend the pipeline to niche domains (medical tools, aerospace parts) by swapping the image‑to‑shape model or the affordance taxonomy.
Limitations & Future Work
- Single‑view reconstruction may miss occluded geometry; complex objects (e.g., with internal cavities) sometimes yield incomplete meshes.
- Physical property estimation relies on visual cues and can mispredict mass for visually similar but materially different items (e.g., plastic vs. metal).
- Affordance taxonomy is fixed; adding new functional categories requires retraining the classifier.
- Scalability beyond 100 K: While the pipeline is fast, storage and bandwidth for distributing massive asset bundles become bottlenecks.
Future directions include multi‑view fusion to improve shape fidelity, integrating tactile simulation for better affordance grounding, and building a cloud‑based asset‑as‑a‑service platform where developers can request custom twins on demand.
Authors
- Kaixuan Wang
- Tianxing Chen
- Jiawei Liu
- Honghao Su
- Shaolong Zhu
- Minxuan Wang
- Zixuan Li
- Yue Chen
- Huan‑ang Gao
- Yusen Qin
- Jiawei Wang
- Qixuan Zhang
- Lan Xu
- Jingyi Yu
- Yao Mu
- Ping Luo
Paper Information
- arXiv ID: 2603.16866v1
- Categories: cs.RO, cs.AI, cs.GR, cs.LG, cs.SE
- Published: March 17, 2026
- PDF: Download PDF