[Paper] Image2Garment: Simulation-ready Garment Generation from a Single Image
Source: arXiv - 2601.09658v1
Overview
The paper presents Image2Garment, a feed‑forward system that can turn a single photograph of clothing into a simulation‑ready 3‑D garment—complete with geometry, material composition, and physical fabric parameters. By leveraging a fine‑tuned vision‑language model and a tiny physics‑measurement dataset, the authors bypass the costly multi‑view capture and iterative optimization pipelines that have dominated the field.
Key Contributions
- Single‑image, simulation‑ready pipeline: Generates full garment meshes and the underlying material physics from just one RGB image.
- Vision‑language fine‑tuning for fabric semantics: Adapts a large pre‑trained model (e.g., CLIP) to predict fabric attributes (e.g., weave, stretch, thickness) directly from real‑world photos.
- Two new datasets:
- FTAG – a curated collection of fashion images annotated with material composition and high‑level fabric attributes.
- T2P – a compact set of measured fabric specimens linking those attributes to concrete physics parameters (e.g., Young’s modulus, damping).
- Lightweight physics‑parameter predictor: A small neural network that maps the predicted attributes to the numerical values required by standard cloth simulators.
- State‑of‑the‑art accuracy: Demonstrates superior material composition estimation and higher‑fidelity simulated drape compared with prior image‑to‑garment methods.
Methodology
-
Data Collection
- FTAG: ~10k fashion images scraped from online catalogs, manually labeled with material tags (cotton, polyester, silk, etc.) and descriptive attributes (knit vs. woven, stretch level, thickness).
- T2P: 200 physical fabric swatches measured in a lab to obtain elastic modulus, shear modulus, density, and damping coefficients.
-
Vision‑Language Model Fine‑Tuning
- Start from a pre‑trained CLIP‑like encoder‑decoder pair.
- Train on FTAG using a contrastive loss that aligns image embeddings with textual attribute descriptors, enabling the model to output a probability distribution over material classes and a vector of continuous fabric attributes.
-
Physics Parameter Estimation
- Feed the attribute vector into a shallow MLP (3–4 layers, < 500k parameters).
- Supervise with the T2P measurements, learning a mapping from high‑level attributes to low‑level physics constants required by a typical Position‑Based Dynamics (PBD) or Finite Element Method (FEM) cloth simulator.
-
Garment Geometry Recovery
- Use an existing single‑image 3‑D reconstruction network (e.g., SMPL‑based body estimator + silhouette‑driven mesh refinement) to obtain the garment’s shape.
- The recovered mesh is then augmented with the predicted physics parameters, yielding a fully simulation‑ready asset.
-
End‑to‑End Inference
- At test time, a single forward pass through the vision‑language model and the MLP produces both the material description and the physics constants, eliminating any iterative optimization.
Results & Findings
| Metric | Image2Garment | Prior Single‑View Methods |
|---|---|---|
| Material composition accuracy (top‑1) | 92.4 % | 78.1 % |
| Fabric attribute MAE (e.g., stretch, thickness) | 0.07 | 0.15 |
| Simulation drape error (RMSE vs. real‑world scan) | 1.8 mm | 3.4 mm |
| Inference time (per garment) | ≈120 ms (GPU) | 2–5 s (iterative) |
- The fine‑tuned vision‑language model outperforms a vanilla ResNet classifier by a large margin on material detection.
- When the predicted physics parameters are fed into a standard cloth simulator (e.g., NVIDIA Flex), the resulting drape matches real‑world reference scans noticeably better than baselines that only predict geometry.
- Ablation studies confirm that the two‑stage attribute‑to‑physics mapping is more data‑efficient than trying to learn physics parameters directly from images.
Practical Implications
- E‑commerce & Virtual Try‑On: Retailers can automatically generate physically accurate 3‑D garments for AR/VR fitting rooms without costly multi‑camera rigs.
- Game & Film Production: Artists can import a single concept sketch or photo and instantly obtain a cloth asset that behaves realistically under animation, cutting down on manual rigging and tweaking.
- Digital Twin for Apparel Manufacturing: Designers can simulate how a new fabric will drape on a body before committing to physical prototypes, accelerating material selection and reducing waste.
- Open‑Source Tooling: Because the pipeline is feed‑forward and relies on lightweight models, it can be packaged as a plug‑in for popular engines (Unity, Unreal) or integrated into pipelines like Blender.
Limitations & Future Work
- Dataset Scope: FTAG covers common consumer fabrics but lacks exotic or highly engineered textiles (e.g., smart fabrics, composites). Extending the attribute taxonomy would broaden applicability.
- Body Pose Dependency: The geometry recovery step assumes a reasonably upright pose; extreme occlusions or non‑standard body shapes can degrade mesh quality.
- Physics Model Simplicity: The current mapping targets standard linear elastic parameters; viscoelastic or anisotropic behaviors are not captured. Future work could incorporate richer constitutive models and learn them from dynamic video data.
- Real‑World Validation: While drape error is measured against lab scans, user studies on perceived realism in interactive settings are still pending.
Image2Garment demonstrates that a clever combination of vision‑language semantics and a tiny physics dataset can bring high‑fidelity cloth simulation within reach of any developer who only has a single product photo. The approach opens the door to scalable, physics‑aware virtual clothing pipelines across retail, entertainment, and design.
Authors
- Selim Emir Can
- Jan Ackermann
- Kiyohiro Nakayama
- Ruofan Liu
- Tong Wu
- Yang Zheng
- Hugo Bertiche
- Menglei Chai
- Thabo Beeler
- Gordon Wetzstein
Paper Information
- arXiv ID: 2601.09658v1
- Categories: cs.CV
- Published: January 14, 2026
- PDF: Download PDF