[Paper] CORAL: Correspondence Alignment for Improved Virtual Try-On
Source: arXiv - 2602.17636v1
Overview
The paper CORAL: Correspondence Alignment for Improved Virtual Try‑On tackles a long‑standing problem in virtual try‑on (VTON) systems: preserving fine‑grained garment details when the model has never seen the exact person‑garment pair before. By dissecting how Diffusion Transformers (DiTs) attend to a person’s body and a target clothing item, the authors devise a way to explicitly align the attention mechanism with reliable correspondence cues, leading to sharper, more realistic try‑on results.
Key Contributions
- Insight into DiT attention: Shows that accurate person‑garment matching hinges on precise query‑key interactions inside the full‑3D attention layers.
- CORAL framework: Introduces a two‑part loss scheme that (1) distills external correspondence signals into the attention map and (2) minimizes entropy to make the attention distribution more decisive.
- VLM‑based evaluation protocol: Proposes a visual‑language model (e.g., CLIP) driven metric that better correlates with human preference than traditional pixel‑wise scores.
- Empirical gains: Demonstrates consistent improvements over strong DiT baselines in both overall shape transfer and preservation of local garment textures.
- Extensive ablations: Validates each component’s contribution and provides practical guidance for integrating CORAL into existing VTON pipelines.
Methodology
- Diagnosing the problem – The authors first visualized attention maps of a vanilla DiT VTON model and observed that the person‑garment correspondence is noisy, especially for small details like seams or patterns.
- External correspondence source – They generate reliable matches between body regions and garment patches using a pre‑trained dense correspondence estimator (e.g., a CNN‑based flow model).
- Correspondence Distillation Loss – During training, the model’s attention scores are encouraged to align with these external matches. Concretely, the loss penalizes divergence between the soft attention distribution and a binary mask derived from the correspondence map.
- Entropy Minimization Loss – To avoid diffuse attention (high entropy), an additional term pushes the attention distribution to be peaked, making the model “confident” about which garment region should attend to which body part.
- Training pipeline – The two losses are added to the standard diffusion‑based reconstruction loss, and the whole system is trained end‑to‑end on unpaired person‑garment datasets.
- Evaluation with VLMs – Instead of relying solely on L1/LPIPS, the authors query a large‑scale vision‑language model (e.g., CLIP) with prompts like “the person is wearing the red floral dress” and rank generated images by similarity to the prompt, yielding a metric that aligns better with user judgments.
Results & Findings
- Quantitative boost: CORAL improves CLIP‑based preference scores by ~4–6 % over the baseline DiT, while also achieving lower LPIPS (better perceptual similarity).
- Detail preservation: Visual comparisons show sharper collars, cuffs, and pattern continuity that were previously blurred or misplaced.
- Robustness to pose variation: The model maintains alignment even when the target person adopts extreme poses, thanks to the explicit query‑key matching.
- Ablation outcomes: Removing the entropy loss leads to scattered attention and degraded texture fidelity; dropping the correspondence distillation reduces the alignment accuracy, confirming both components are essential.
Practical Implications
- E‑commerce try‑on: Retailers can deploy CORAL‑enhanced VTON engines to give shoppers more realistic previews, potentially reducing return rates.
- AR/VR fashion apps: Developers building live‑fit experiences can leverage the correspondence‑aligned attention to render garments with fine details in real time.
- Design iteration tools: Fashion designers can quickly prototype how a new pattern drapes on diverse body shapes without manually tweaking alignment.
- Integration path: Since CORAL builds on existing DiT architectures, teams can adopt it by adding the two loss terms and a lightweight correspondence extractor—no need to redesign the entire diffusion pipeline.
Limitations & Future Work
- Dependency on external correspondences: The quality of the alignment hinges on the pre‑trained correspondence estimator; errors there propagate into the attention map.
- Computational overhead: Computing dense correspondences and the extra loss terms adds modest training time, though inference remains comparable to vanilla DiT.
- Unpaired data focus: The method is evaluated primarily on unpaired datasets; extending it to paired or multi‑garment scenarios could further broaden its applicability.
- Future directions: The authors suggest exploring self‑supervised correspondence learning within the diffusion model itself, and testing the approach on video‑based try‑on where temporal consistency becomes critical.
Authors
- Jiyoung Kim
- Youngjin Shin
- Siyoon Jin
- Dahyun Chung
- Jisu Nam
- Tongmin Kim
- Jongjae Park
- Hyeonwoo Kang
- Seungryong Kim
Paper Information
- arXiv ID: 2602.17636v1
- Categories: cs.CV
- Published: February 19, 2026
- PDF: Download PDF