[Paper] CORAL: Correspondence Alignment for Improved Virtual Try-On

Published: (February 19, 2026 at 01:50 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17636v1

Overview

The paper CORAL: Correspondence Alignment for Improved Virtual Try‑On tackles a long‑standing problem in virtual try‑on (VTON) systems: preserving fine‑grained garment details when the model has never seen the exact person‑garment pair before. By dissecting how Diffusion Transformers (DiTs) attend to a person’s body and a target clothing item, the authors devise a way to explicitly align the attention mechanism with reliable correspondence cues, leading to sharper, more realistic try‑on results.

Key Contributions

  • Insight into DiT attention: Shows that accurate person‑garment matching hinges on precise query‑key interactions inside the full‑3D attention layers.
  • CORAL framework: Introduces a two‑part loss scheme that (1) distills external correspondence signals into the attention map and (2) minimizes entropy to make the attention distribution more decisive.
  • VLM‑based evaluation protocol: Proposes a visual‑language model (e.g., CLIP) driven metric that better correlates with human preference than traditional pixel‑wise scores.
  • Empirical gains: Demonstrates consistent improvements over strong DiT baselines in both overall shape transfer and preservation of local garment textures.
  • Extensive ablations: Validates each component’s contribution and provides practical guidance for integrating CORAL into existing VTON pipelines.

Methodology

  1. Diagnosing the problem – The authors first visualized attention maps of a vanilla DiT VTON model and observed that the person‑garment correspondence is noisy, especially for small details like seams or patterns.
  2. External correspondence source – They generate reliable matches between body regions and garment patches using a pre‑trained dense correspondence estimator (e.g., a CNN‑based flow model).
  3. Correspondence Distillation Loss – During training, the model’s attention scores are encouraged to align with these external matches. Concretely, the loss penalizes divergence between the soft attention distribution and a binary mask derived from the correspondence map.
  4. Entropy Minimization Loss – To avoid diffuse attention (high entropy), an additional term pushes the attention distribution to be peaked, making the model “confident” about which garment region should attend to which body part.
  5. Training pipeline – The two losses are added to the standard diffusion‑based reconstruction loss, and the whole system is trained end‑to‑end on unpaired person‑garment datasets.
  6. Evaluation with VLMs – Instead of relying solely on L1/LPIPS, the authors query a large‑scale vision‑language model (e.g., CLIP) with prompts like “the person is wearing the red floral dress” and rank generated images by similarity to the prompt, yielding a metric that aligns better with user judgments.

Results & Findings

  • Quantitative boost: CORAL improves CLIP‑based preference scores by ~4–6 % over the baseline DiT, while also achieving lower LPIPS (better perceptual similarity).
  • Detail preservation: Visual comparisons show sharper collars, cuffs, and pattern continuity that were previously blurred or misplaced.
  • Robustness to pose variation: The model maintains alignment even when the target person adopts extreme poses, thanks to the explicit query‑key matching.
  • Ablation outcomes: Removing the entropy loss leads to scattered attention and degraded texture fidelity; dropping the correspondence distillation reduces the alignment accuracy, confirming both components are essential.

Practical Implications

  • E‑commerce try‑on: Retailers can deploy CORAL‑enhanced VTON engines to give shoppers more realistic previews, potentially reducing return rates.
  • AR/VR fashion apps: Developers building live‑fit experiences can leverage the correspondence‑aligned attention to render garments with fine details in real time.
  • Design iteration tools: Fashion designers can quickly prototype how a new pattern drapes on diverse body shapes without manually tweaking alignment.
  • Integration path: Since CORAL builds on existing DiT architectures, teams can adopt it by adding the two loss terms and a lightweight correspondence extractor—no need to redesign the entire diffusion pipeline.

Limitations & Future Work

  • Dependency on external correspondences: The quality of the alignment hinges on the pre‑trained correspondence estimator; errors there propagate into the attention map.
  • Computational overhead: Computing dense correspondences and the extra loss terms adds modest training time, though inference remains comparable to vanilla DiT.
  • Unpaired data focus: The method is evaluated primarily on unpaired datasets; extending it to paired or multi‑garment scenarios could further broaden its applicability.
  • Future directions: The authors suggest exploring self‑supervised correspondence learning within the diffusion model itself, and testing the approach on video‑based try‑on where temporal consistency becomes critical.

Authors

  • Jiyoung Kim
  • Youngjin Shin
  • Siyoon Jin
  • Dahyun Chung
  • Jisu Nam
  • Tongmin Kim
  • Jongjae Park
  • Hyeonwoo Kang
  • Seungryong Kim

Paper Information

  • arXiv ID: 2602.17636v1
  • Categories: cs.CV
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »