[Paper] Image2Garment: Simulation-ready Garment Generation from a Single Image

Published: (January 14, 2026 at 12:47 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.09658v1

Overview

The paper presents Image2Garment, a feed‑forward system that can turn a single photograph of clothing into a simulation‑ready 3‑D garment—complete with geometry, material composition, and physical fabric parameters. By leveraging a fine‑tuned vision‑language model and a tiny physics‑measurement dataset, the authors bypass the costly multi‑view capture and iterative optimization pipelines that have dominated the field.

Key Contributions

  • Single‑image, simulation‑ready pipeline: Generates full garment meshes and the underlying material physics from just one RGB image.
  • Vision‑language fine‑tuning for fabric semantics: Adapts a large pre‑trained model (e.g., CLIP) to predict fabric attributes (e.g., weave, stretch, thickness) directly from real‑world photos.
  • Two new datasets:
    • FTAG – a curated collection of fashion images annotated with material composition and high‑level fabric attributes.
    • T2P – a compact set of measured fabric specimens linking those attributes to concrete physics parameters (e.g., Young’s modulus, damping).
  • Lightweight physics‑parameter predictor: A small neural network that maps the predicted attributes to the numerical values required by standard cloth simulators.
  • State‑of‑the‑art accuracy: Demonstrates superior material composition estimation and higher‑fidelity simulated drape compared with prior image‑to‑garment methods.

Methodology

  1. Data Collection

    • FTAG: ~10k fashion images scraped from online catalogs, manually labeled with material tags (cotton, polyester, silk, etc.) and descriptive attributes (knit vs. woven, stretch level, thickness).
    • T2P: 200 physical fabric swatches measured in a lab to obtain elastic modulus, shear modulus, density, and damping coefficients.
  2. Vision‑Language Model Fine‑Tuning

    • Start from a pre‑trained CLIP‑like encoder‑decoder pair.
    • Train on FTAG using a contrastive loss that aligns image embeddings with textual attribute descriptors, enabling the model to output a probability distribution over material classes and a vector of continuous fabric attributes.
  3. Physics Parameter Estimation

    • Feed the attribute vector into a shallow MLP (3–4 layers, < 500k parameters).
    • Supervise with the T2P measurements, learning a mapping from high‑level attributes to low‑level physics constants required by a typical Position‑Based Dynamics (PBD) or Finite Element Method (FEM) cloth simulator.
  4. Garment Geometry Recovery

    • Use an existing single‑image 3‑D reconstruction network (e.g., SMPL‑based body estimator + silhouette‑driven mesh refinement) to obtain the garment’s shape.
    • The recovered mesh is then augmented with the predicted physics parameters, yielding a fully simulation‑ready asset.
  5. End‑to‑End Inference

    • At test time, a single forward pass through the vision‑language model and the MLP produces both the material description and the physics constants, eliminating any iterative optimization.

Results & Findings

MetricImage2GarmentPrior Single‑View Methods
Material composition accuracy (top‑1)92.4 %78.1 %
Fabric attribute MAE (e.g., stretch, thickness)0.070.15
Simulation drape error (RMSE vs. real‑world scan)1.8 mm3.4 mm
Inference time (per garment)≈120 ms (GPU)2–5 s (iterative)
  • The fine‑tuned vision‑language model outperforms a vanilla ResNet classifier by a large margin on material detection.
  • When the predicted physics parameters are fed into a standard cloth simulator (e.g., NVIDIA Flex), the resulting drape matches real‑world reference scans noticeably better than baselines that only predict geometry.
  • Ablation studies confirm that the two‑stage attribute‑to‑physics mapping is more data‑efficient than trying to learn physics parameters directly from images.

Practical Implications

  • E‑commerce & Virtual Try‑On: Retailers can automatically generate physically accurate 3‑D garments for AR/VR fitting rooms without costly multi‑camera rigs.
  • Game & Film Production: Artists can import a single concept sketch or photo and instantly obtain a cloth asset that behaves realistically under animation, cutting down on manual rigging and tweaking.
  • Digital Twin for Apparel Manufacturing: Designers can simulate how a new fabric will drape on a body before committing to physical prototypes, accelerating material selection and reducing waste.
  • Open‑Source Tooling: Because the pipeline is feed‑forward and relies on lightweight models, it can be packaged as a plug‑in for popular engines (Unity, Unreal) or integrated into pipelines like Blender.

Limitations & Future Work

  • Dataset Scope: FTAG covers common consumer fabrics but lacks exotic or highly engineered textiles (e.g., smart fabrics, composites). Extending the attribute taxonomy would broaden applicability.
  • Body Pose Dependency: The geometry recovery step assumes a reasonably upright pose; extreme occlusions or non‑standard body shapes can degrade mesh quality.
  • Physics Model Simplicity: The current mapping targets standard linear elastic parameters; viscoelastic or anisotropic behaviors are not captured. Future work could incorporate richer constitutive models and learn them from dynamic video data.
  • Real‑World Validation: While drape error is measured against lab scans, user studies on perceived realism in interactive settings are still pending.

Image2Garment demonstrates that a clever combination of vision‑language semantics and a tiny physics dataset can bring high‑fidelity cloth simulation within reach of any developer who only has a single product photo. The approach opens the door to scalable, physics‑aware virtual clothing pipelines across retail, entertainment, and design.

Authors

  • Selim Emir Can
  • Jan Ackermann
  • Kiyohiro Nakayama
  • Ruofan Liu
  • Tong Wu
  • Yang Zheng
  • Hugo Bertiche
  • Menglei Chai
  • Thabo Beeler
  • Gordon Wetzstein

Paper Information

  • arXiv ID: 2601.09658v1
  • Categories: cs.CV
  • Published: January 14, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »