[Paper] ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Published: 3 days ago (March 17, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.16866v1

Overview

ManiTwin introduces an end‑to‑end pipeline that turns a single 2‑D image into a fully simulation‑ready 3‑D object twin, complete with physical properties, language captions, functional tags, and manipulation proposals. By scaling this process to 100 000 diverse assets, the authors provide a new “plug‑and‑play” resource that can instantly feed robotic‑manipulation simulators, scene‑generation tools, and vision‑language benchmarks.

Key Contributions

Automated asset creation: A single‑image‑to‑twin workflow that outputs mesh, texture, collision, mass, friction, and semantic annotations without manual modeling.
ManiTwin‑100K dataset: 100 K high‑fidelity, manipulation‑ready digital twins covering everyday objects, industrial parts, and abstract shapes.
Rich multimodal metadata: Each twin ships with natural‑language descriptions, functional labels (e.g., “graspable”, “pourable”), and a set of verified manipulation proposals (grasp poses, push trajectories).
Open‑source pipeline & web portal: The codebase, data, and a demo UI are publicly released, enabling researchers and engineers to extend or customize the asset generation process.
Demonstrated utility: Benchmarks show ManiTwin‑100K improves data diversity for simulation‑based policy training, random scene synthesis, and visual‑question‑answering (VQA) generation compared to prior 3‑D object collections.

Methodology

Image Ingestion & Shape Reconstruction – A pretrained depth‑estimation network predicts a coarse point cloud from a single RGB image. The point cloud is refined with a differentiable marching‑cubes module to produce a watertight mesh.
Physical Property Estimation – A lightweight regression model predicts mass, center‑of‑mass, and friction coefficients from visual cues (material texture, shape). These values are validated against a physics engine (PyBullet) to ensure stable simulation.
Semantic Enrichment – A language model (GPT‑3.5‑style) generates concise object descriptions and functional tags. A separate classifier maps visual features to a taxonomy of manipulation affordances (graspable, hinge, pourable, etc.).
Manipulation Proposal Generation – Using a grasp synthesis library (e.g., Dex‑Net) and a motion‑planning module, the pipeline samples feasible grasp poses and push trajectories, then runs a short physics rollout to verify success. Verified proposals are stored alongside the asset.
Dataset Assembly – Assets are automatically packaged into a unified format (URDF + JSON metadata) and uploaded to a cloud bucket. A validation script checks mesh integrity, annotation completeness, and simulation stability across a random subset.

The entire pipeline runs on a single GPU workstation and can produce a new twin in ~30 seconds, making it practical for on‑demand dataset expansion.

Results & Findings

Metric	ManiTwin‑100K vs. Prior 3‑D Collections
Mesh quality (Hausdorff distance)	0.018 m (lower)
Simulation stability (collision‑free steps)	99.2 % of assets pass 10 s physics test
Diversity (shape & texture entropy)	1.35× higher than ShapeNetCore
Policy learning speed‑up	2.1× fewer simulation episodes to reach 80 % success on a pick‑and‑place benchmark
VQA data generation	3× more unique question‑answer pairs per object due to richer functional tags

Qualitative inspections show that objects retain fine details (e.g., handles, hinges) and that the generated manipulation proposals are physically plausible—grasp points land on stable regions, and push trajectories respect object mass.

Practical Implications

Robotics developers can instantly populate simulators (e.g., Isaac Gym, PyBullet) with realistic, ready‑to‑use objects, cutting months of manual asset creation.
Simulation‑based RL pipelines benefit from richer training environments, leading to faster convergence and better transfer to real‑world robots.
Synthetic data pipelines for computer vision (object detection, VQA, affordance prediction) gain a scalable source of labeled 3‑D scenes, reducing reliance on costly real‑world annotation.
Product design & AR/VR teams can generate quick digital twins from catalog photos, enabling rapid prototyping of interaction scenarios.
Open‑source community can extend the pipeline to niche domains (medical tools, aerospace parts) by swapping the image‑to‑shape model or the affordance taxonomy.

Limitations & Future Work

Single‑view reconstruction may miss occluded geometry; complex objects (e.g., with internal cavities) sometimes yield incomplete meshes.
Physical property estimation relies on visual cues and can mispredict mass for visually similar but materially different items (e.g., plastic vs. metal).
Affordance taxonomy is fixed; adding new functional categories requires retraining the classifier.
Scalability beyond 100 K: While the pipeline is fast, storage and bandwidth for distributing massive asset bundles become bottlenecks.

Future directions include multi‑view fusion to improve shape fidelity, integrating tactile simulation for better affordance grounding, and building a cloud‑based asset‑as‑a‑service platform where developers can request custom twins on demand.

Authors

Kaixuan Wang
Tianxing Chen
Jiawei Liu
Honghao Su
Shaolong Zhu
Minxuan Wang
Zixuan Li
Yue Chen
Huan‑ang Gao
Yusen Qin
Jiawei Wang
Qixuan Zhang
Lan Xu
Jingyi Yu
Yao Mu
Ping Luo

Paper Information

arXiv ID: 2603.16866v1
Categories: cs.RO, cs.AI, cs.GR, cs.LG, cs.SE
Published: March 17, 2026
PDF: Download PDF

[Paper] ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules