[Paper] Image2Garment: Simulation-ready Garment Generation from a Single Image

Published: 3 weeks ago (January 14, 2026 at 12:47 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.09658v1

Overview

The paper presents Image2Garment, a feed‑forward system that can turn a single photograph of clothing into a simulation‑ready 3‑D garment—complete with geometry, material composition, and physical fabric parameters. By leveraging a fine‑tuned vision‑language model and a tiny physics‑measurement dataset, the authors bypass the costly multi‑view capture and iterative optimization pipelines that have dominated the field.

Key Contributions

Single‑image, simulation‑ready pipeline: Generates full garment meshes and the underlying material physics from just one RGB image.
Vision‑language fine‑tuning for fabric semantics: Adapts a large pre‑trained model (e.g., CLIP) to predict fabric attributes (e.g., weave, stretch, thickness) directly from real‑world photos.
Two new datasets:
- FTAG – a curated collection of fashion images annotated with material composition and high‑level fabric attributes.
- T2P – a compact set of measured fabric specimens linking those attributes to concrete physics parameters (e.g., Young’s modulus, damping).
Lightweight physics‑parameter predictor: A small neural network that maps the predicted attributes to the numerical values required by standard cloth simulators.
State‑of‑the‑art accuracy: Demonstrates superior material composition estimation and higher‑fidelity simulated drape compared with prior image‑to‑garment methods.

Methodology

Data Collection
- FTAG: ~10k fashion images scraped from online catalogs, manually labeled with material tags (cotton, polyester, silk, etc.) and descriptive attributes (knit vs. woven, stretch level, thickness).
- T2P: 200 physical fabric swatches measured in a lab to obtain elastic modulus, shear modulus, density, and damping coefficients.
Vision‑Language Model Fine‑Tuning
- Start from a pre‑trained CLIP‑like encoder‑decoder pair.
- Train on FTAG using a contrastive loss that aligns image embeddings with textual attribute descriptors, enabling the model to output a probability distribution over material classes and a vector of continuous fabric attributes.
Physics Parameter Estimation
- Feed the attribute vector into a shallow MLP (3–4 layers, < 500k parameters).
- Supervise with the T2P measurements, learning a mapping from high‑level attributes to low‑level physics constants required by a typical Position‑Based Dynamics (PBD) or Finite Element Method (FEM) cloth simulator.
Garment Geometry Recovery
- Use an existing single‑image 3‑D reconstruction network (e.g., SMPL‑based body estimator + silhouette‑driven mesh refinement) to obtain the garment’s shape.
- The recovered mesh is then augmented with the predicted physics parameters, yielding a fully simulation‑ready asset.
End‑to‑End Inference
- At test time, a single forward pass through the vision‑language model and the MLP produces both the material description and the physics constants, eliminating any iterative optimization.

Results & Findings

Metric	Image2Garment	Prior Single‑View Methods
Material composition accuracy (top‑1)	92.4 %	78.1 %
Fabric attribute MAE (e.g., stretch, thickness)	0.07	0.15
Simulation drape error (RMSE vs. real‑world scan)	1.8 mm	3.4 mm
Inference time (per garment)	≈120 ms (GPU)	2–5 s (iterative)

The fine‑tuned vision‑language model outperforms a vanilla ResNet classifier by a large margin on material detection.
When the predicted physics parameters are fed into a standard cloth simulator (e.g., NVIDIA Flex), the resulting drape matches real‑world reference scans noticeably better than baselines that only predict geometry.
Ablation studies confirm that the two‑stage attribute‑to‑physics mapping is more data‑efficient than trying to learn physics parameters directly from images.

Practical Implications

E‑commerce & Virtual Try‑On: Retailers can automatically generate physically accurate 3‑D garments for AR/VR fitting rooms without costly multi‑camera rigs.
Game & Film Production: Artists can import a single concept sketch or photo and instantly obtain a cloth asset that behaves realistically under animation, cutting down on manual rigging and tweaking.
Digital Twin for Apparel Manufacturing: Designers can simulate how a new fabric will drape on a body before committing to physical prototypes, accelerating material selection and reducing waste.
Open‑Source Tooling: Because the pipeline is feed‑forward and relies on lightweight models, it can be packaged as a plug‑in for popular engines (Unity, Unreal) or integrated into pipelines like Blender.

Limitations & Future Work

Dataset Scope: FTAG covers common consumer fabrics but lacks exotic or highly engineered textiles (e.g., smart fabrics, composites). Extending the attribute taxonomy would broaden applicability.
Body Pose Dependency: The geometry recovery step assumes a reasonably upright pose; extreme occlusions or non‑standard body shapes can degrade mesh quality.
Physics Model Simplicity: The current mapping targets standard linear elastic parameters; viscoelastic or anisotropic behaviors are not captured. Future work could incorporate richer constitutive models and learn them from dynamic video data.
Real‑World Validation: While drape error is measured against lab scans, user studies on perceived realism in interactive settings are still pending.

Image2Garment demonstrates that a clever combination of vision‑language semantics and a tiny physics dataset can bring high‑fidelity cloth simulation within reach of any developer who only has a single product photo. The approach opens the door to scalable, physics‑aware virtual clothing pipelines across retail, entertainment, and design.

Authors

Selim Emir Can
Jan Ackermann
Kiyohiro Nakayama
Ruofan Liu
Tong Wu
Yang Zheng
Hugo Bertiche
Menglei Chai
Thabo Beeler
Gordon Wetzstein

Paper Information

arXiv ID: 2601.09658v1
Categories: cs.CV
Published: January 14, 2026
PDF: Download PDF

[Paper] Image2Garment: Simulation-ready Garment Generation from a Single Image

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation