[Paper] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
Source: arXiv - 2603.02210v1
Overview
The paper introduces HiFi‑Inpaint, a reference‑based image inpainting system that can seamlessly insert products into human photos while preserving every fine detail of the product. By combining a new attention module, a detail‑focused loss, and a 40 K‑image dataset, the authors push the realism of generated human‑product images—an essential capability for advertising, e‑commerce, and virtual try‑on experiences.
Key Contributions
- Shared Enhancement Attention (SEA) – a lightweight attention block that explicitly aligns and sharpens product features from a reference image during the inpainting process.
- Detail‑Aware Loss (DAL) – a training objective that penalizes errors in high‑frequency (edge/texture) components, forcing the network to reproduce crisp product details.
- HP‑Image‑40K dataset – a publicly released collection of 40 000 human‑product pairs generated via self‑synthesis pipelines and automatically filtered for quality, filling a long‑standing data gap.
- State‑of‑the‑art performance – quantitative (higher PSNR/SSIM, lower LPIPS) and qualitative results that outperform prior reference‑based inpainting methods on both synthetic and real‑world benchmarks.
Methodology
- Reference‑guided pipeline – The model receives a target image with a masked region (where the product should appear) and a reference product image.
- Shared Enhancement Attention – SEA extracts multi‑scale feature maps from both inputs, computes cross‑attention scores, and injects the most relevant product details back into the masked region. This shared attention is applied at several decoder stages, ensuring that fine textures (e.g., fabric weave, logo embossing) survive the generation process.
- Detail‑Aware Loss – Instead of supervising only on raw RGB pixels, DAL first runs a high‑pass filter (e.g., Laplacian) on both the generated and ground‑truth images to obtain high‑frequency maps. The loss combines an L1 term on these maps with the usual reconstruction loss, encouraging the network to match edges and textures pixel‑by‑pixel.
- Training on HP‑Image‑40K – The dataset provides paired (masked target, reference, ground‑truth) samples. Automatic filtering removes low‑quality syntheses, allowing the model to learn from diverse poses, lighting, and product categories.
The overall architecture remains a standard encoder‑decoder with skip connections; the novelty lies in the SEA modules and DAL supervision that together steer the network toward high‑fidelity detail preservation.
Results & Findings
| Metric (higher is better) | HiFi‑Inpaint | Prior Ref‑Inpaint (e.g., RFR‑Inpaint) |
|---|---|---|
| PSNR | 31.8 dB | 29.4 dB |
| SSIM | 0.94 | 0.90 |
| LPIPS (lower is better) | 0.12 | 0.18 |
- Visual quality: Side‑by‑side comparisons show HiFi‑Inpaint retaining sharp logos, stitching patterns, and reflective surfaces that other methods blur or distort.
- Robustness to pose & lighting: The model consistently inserts products across varied human poses and complex backgrounds, thanks to SEA’s ability to focus on the most relevant reference features.
- Ablation studies: Removing SEA drops PSNR by ~1.2 dB, while omitting DAL increases LPIPS by ~0.05, confirming that both components are essential for detail fidelity.
Practical Implications
- E‑commerce catalog generation – Retailers can automatically generate model shots for new products without costly photo shoots, ensuring the product’s texture and branding stay intact.
- Virtual try‑on & AR – Apps that overlay clothing, accessories, or gadgets onto users’ live camera feeds can leverage HiFi‑Inpaint to produce photorealistic results in real time, improving user confidence.
- Marketing automation – Agencies can quickly produce high‑quality ad creatives that combine influencers or models with a library of product images, cutting turnaround time.
- Dataset creation – The HP‑Image‑40K dataset can serve as a benchmark for future research on reference‑guided generation, encouraging more industry‑focused solutions.
Limitations & Future Work
- Domain shift – The model is trained on synthetic‑plus‑filtered data; performance may degrade on extreme lighting or highly reflective materials not represented in HP‑Image‑40K.
- Computation cost – SEA introduces additional attention calculations, modestly increasing inference latency, which could be a bottleneck for real‑time mobile AR.
- Single‑reference dependence – The current framework assumes one clean product reference; handling multiple or partially occluded references remains an open challenge.
Future directions include extending the attention mechanism to multi‑reference scenarios, optimizing the architecture for edge devices, and enriching the dataset with more diverse real‑world captures to further close the sim‑to‑real gap.
Authors
- Yichen Liu
- Donghao Zhou
- Jie Wang
- Xin Gao
- Guisheng Liu
- Jiatong Li
- Quanwei Zhang
- Qiang Lyu
- Lanqing Guo
- Shilei Wen
- Weiqiang Wang
- Pheng-Ann Heng
Paper Information
- arXiv ID: 2603.02210v1
- Categories: cs.CV
- Published: March 2, 2026
- PDF: Download PDF