[Paper] CORAL: Correspondence Alignment for Improved Virtual Try-On

Published: 3 days ago (February 19, 2026 at 01:50 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.17636v1

Overview

The paper CORAL: Correspondence Alignment for Improved Virtual Try‑On tackles a long‑standing problem in virtual try‑on (VTON) systems: preserving fine‑grained garment details when the model has never seen the exact person‑garment pair before. By dissecting how Diffusion Transformers (DiTs) attend to a person’s body and a target clothing item, the authors devise a way to explicitly align the attention mechanism with reliable correspondence cues, leading to sharper, more realistic try‑on results.

Key Contributions

Insight into DiT attention: Shows that accurate person‑garment matching hinges on precise query‑key interactions inside the full‑3D attention layers.
CORAL framework: Introduces a two‑part loss scheme that (1) distills external correspondence signals into the attention map and (2) minimizes entropy to make the attention distribution more decisive.
VLM‑based evaluation protocol: Proposes a visual‑language model (e.g., CLIP) driven metric that better correlates with human preference than traditional pixel‑wise scores.
Empirical gains: Demonstrates consistent improvements over strong DiT baselines in both overall shape transfer and preservation of local garment textures.
Extensive ablations: Validates each component’s contribution and provides practical guidance for integrating CORAL into existing VTON pipelines.

Methodology

Diagnosing the problem – The authors first visualized attention maps of a vanilla DiT VTON model and observed that the person‑garment correspondence is noisy, especially for small details like seams or patterns.
External correspondence source – They generate reliable matches between body regions and garment patches using a pre‑trained dense correspondence estimator (e.g., a CNN‑based flow model).
Correspondence Distillation Loss – During training, the model’s attention scores are encouraged to align with these external matches. Concretely, the loss penalizes divergence between the soft attention distribution and a binary mask derived from the correspondence map.
Entropy Minimization Loss – To avoid diffuse attention (high entropy), an additional term pushes the attention distribution to be peaked, making the model “confident” about which garment region should attend to which body part.
Training pipeline – The two losses are added to the standard diffusion‑based reconstruction loss, and the whole system is trained end‑to‑end on unpaired person‑garment datasets.
Evaluation with VLMs – Instead of relying solely on L1/LPIPS, the authors query a large‑scale vision‑language model (e.g., CLIP) with prompts like “the person is wearing the red floral dress” and rank generated images by similarity to the prompt, yielding a metric that aligns better with user judgments.

Results & Findings

Quantitative boost: CORAL improves CLIP‑based preference scores by ~4–6 % over the baseline DiT, while also achieving lower LPIPS (better perceptual similarity).
Detail preservation: Visual comparisons show sharper collars, cuffs, and pattern continuity that were previously blurred or misplaced.
Robustness to pose variation: The model maintains alignment even when the target person adopts extreme poses, thanks to the explicit query‑key matching.
Ablation outcomes: Removing the entropy loss leads to scattered attention and degraded texture fidelity; dropping the correspondence distillation reduces the alignment accuracy, confirming both components are essential.

Practical Implications

E‑commerce try‑on: Retailers can deploy CORAL‑enhanced VTON engines to give shoppers more realistic previews, potentially reducing return rates.
AR/VR fashion apps: Developers building live‑fit experiences can leverage the correspondence‑aligned attention to render garments with fine details in real time.
Design iteration tools: Fashion designers can quickly prototype how a new pattern drapes on diverse body shapes without manually tweaking alignment.
Integration path: Since CORAL builds on existing DiT architectures, teams can adopt it by adding the two loss terms and a lightweight correspondence extractor—no need to redesign the entire diffusion pipeline.

Limitations & Future Work

Dependency on external correspondences: The quality of the alignment hinges on the pre‑trained correspondence estimator; errors there propagate into the attention map.
Computational overhead: Computing dense correspondences and the extra loss terms adds modest training time, though inference remains comparable to vanilla DiT.
Unpaired data focus: The method is evaluated primarily on unpaired datasets; extending it to paired or multi‑garment scenarios could further broaden its applicability.
Future directions: The authors suggest exploring self‑supervised correspondence learning within the diffusion model itself, and testing the approach on video‑based try‑on where temporal consistency becomes critical.

Authors

Jiyoung Kim
Youngjin Shin
Siyoon Jin
Dahyun Chung
Jisu Nam
Tongmin Kim
Jongjae Park
Hyeonwoo Kang
Seungryong Kim

Paper Information

arXiv ID: 2602.17636v1
Categories: cs.CV
Published: February 19, 2026
PDF: Download PDF

[Paper] CORAL: Correspondence Alignment for Improved Virtual Try-On

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

[Paper] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

[Paper] Human-level 3D shape perception emerges from multi-view learning

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting