[Paper] Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Source: arXiv - 2604.19748v1
Overview
The paper introduces Tstars‑Tryon 1.0, a commercial‑grade virtual‑try‑on system that can realistically dress a person in a wide variety of fashion items—from shirts and dresses to accessories—even under challenging real‑world conditions such as extreme poses, harsh lighting, or motion blur. By combining a carefully engineered model pipeline with a massive data engine, the authors achieve near‑real‑time performance that is already serving millions of users on the Taobao app.
Key Contributions
- Robustness across “in‑the‑wild” scenarios – high success rate on extreme poses, low‑light, motion‑blurred, and occluded inputs.
- Photorealistic output – fine‑grained texture, material, and structural fidelity while suppressing typical AI artifacts (e.g., blurry seams, ghosting).
- Multi‑category, multi‑image composition – supports up to 6 reference images and 8 fashion categories, with controllable person identity and background.
- Speed‑optimized inference – engineered to run in near real‑time (≈30 fps on a single GPU), suitable for large‑scale consumer apps.
- End‑to‑end system design – unified architecture, scalable data pipeline, and a multi‑stage training regime that together enable commercial deployment.
- Public benchmark & dataset – the authors release a comprehensive benchmark to spur further research in realistic virtual try‑on.
Methodology
-
Data Engine & Pre‑processing
- Collected >10 M garment‑person pairs from e‑commerce platforms.
- Automated cleaning, pose normalization, and illumination balancing to ensure diverse yet high‑quality training samples.
-
Model Architecture
- Coarse Stage: A conditional diffusion model predicts a rough layout of the garment on the target body, handling pose warping and occlusion.
- Refinement Stage: A high‑resolution GAN (with spatial‑aware attention) injects texture details, material cues (e.g., silk sheen, denim weave), and corrects edge artifacts.
- Control Module: A lightweight encoder lets users specify identity (face, body shape) and background, enabling seamless multi‑image composition.
-
Training Paradigm
- Stage‑1: Self‑supervised pose‑guided warping using synthetic overlays.
- Stage‑2: Paired adversarial training on the cleaned dataset to learn realistic texture transfer.
- Stage‑3: Fine‑tuning with reinforcement‑style loss that penalizes visual artifacts detected by a pre‑trained perceptual quality network.
-
Inference Optimizations
- Model pruning and quantization reduce memory footprint.
- TensorRT‑based kernel fusion and asynchronous pipeline scheduling cut latency to < 30 ms per request on a V100 GPU.
Results & Findings
| Metric | Tstars‑Tryon 1.0 | Prior SOTA (e.g., VITON‑HD) |
|---|---|---|
| Success Rate (valid try‑on) | 96.8 % | 84.3 % |
| FID (image quality) | 12.4 | 21.7 |
| LPIPS (perceptual similarity) | 0.098 | 0.167 |
| Inference Latency (GPU) | ≈30 ms | 180 ms |
| Supported Categories | 8 (apparel + accessories) | 3–4 |
- Robustness: The system maintained > 95 % success on extreme‑pose test sets (e.g., squat, side‑view) and on low‑light images where previous methods failed catastrophically.
- Realism: Human evaluators preferred Tstars‑Tryon outputs 82 % of the time over competing methods, citing natural drape and accurate material shine.
- Scalability: Deployed on Taobao, the service handled > 10 M daily requests with < 0.5 % error rate, confirming that the engineering optimizations translate to production stability.
Practical Implications
- E‑commerce Integration: Retailers can embed a “see‑it‑on‑me” button that instantly visualizes any garment on a shopper’s uploaded photo, reducing return rates and increasing conversion.
- Personalized Styling Apps: Developers can build virtual wardrobes that mix‑and‑match items across categories (e.g., shoes + bags) while preserving the user’s face and background, enabling richer AR experiences.
- Content Creation: Marketing teams can generate high‑quality look‑book images without costly photoshoots, simply by feeding product catalog images into the model.
- Edge Deployment: The low‑latency inference pipeline makes it feasible to run the model on powerful edge devices (e.g., modern smartphones) for offline try‑on, preserving user privacy.
Limitations & Future Work
- Extreme Occlusions: While robust, the system still struggles when large body parts are fully hidden (e.g., a person holding a large object).
- Fine‑Material Physics: Dynamic fabrics (e.g., flowing scarves) are approximated rather than physically simulated, limiting realism for highly animated garments.
- Cross‑Domain Generalization: Performance drops when the input garment comes from a style or lighting condition not represented in the training data (e.g., underwater photography).
- Future Directions: The authors plan to integrate physics‑based cloth simulation for better drape, expand the dataset to cover more exotic lighting conditions, and explore on‑device model distillation to further reduce latency.
Authors
- Mengting Chen
- Zhengrui Chen
- Yongchao Du
- Zuan Gao
- Taihang Hu
- Jinsong Lan
- Chao Lin
- Yefeng Shen
- Xingjian Wang
- Zhao Wang
- Zhengtao Wu
- Xiaoli Xu
- Zhengze Xu
- Hao Yan
- Mingzhou Zhang
- Jun Zheng
- Qinye Zhou
- Xiaoyong Zhu
- Bo Zheng
Paper Information
- arXiv ID: 2604.19748v1
- Categories: cs.CV
- Published: April 21, 2026
- PDF: Download PDF