[Paper] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Published: 1 day ago (March 2, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.02210v1

Overview

The paper introduces HiFi‑Inpaint, a reference‑based image inpainting system that can seamlessly insert products into human photos while preserving every fine detail of the product. By combining a new attention module, a detail‑focused loss, and a 40 K‑image dataset, the authors push the realism of generated human‑product images—an essential capability for advertising, e‑commerce, and virtual try‑on experiences.

Key Contributions

Shared Enhancement Attention (SEA) – a lightweight attention block that explicitly aligns and sharpens product features from a reference image during the inpainting process.
Detail‑Aware Loss (DAL) – a training objective that penalizes errors in high‑frequency (edge/texture) components, forcing the network to reproduce crisp product details.
HP‑Image‑40K dataset – a publicly released collection of 40 000 human‑product pairs generated via self‑synthesis pipelines and automatically filtered for quality, filling a long‑standing data gap.
State‑of‑the‑art performance – quantitative (higher PSNR/SSIM, lower LPIPS) and qualitative results that outperform prior reference‑based inpainting methods on both synthetic and real‑world benchmarks.

Methodology

Reference‑guided pipeline – The model receives a target image with a masked region (where the product should appear) and a reference product image.
Shared Enhancement Attention – SEA extracts multi‑scale feature maps from both inputs, computes cross‑attention scores, and injects the most relevant product details back into the masked region. This shared attention is applied at several decoder stages, ensuring that fine textures (e.g., fabric weave, logo embossing) survive the generation process.
Detail‑Aware Loss – Instead of supervising only on raw RGB pixels, DAL first runs a high‑pass filter (e.g., Laplacian) on both the generated and ground‑truth images to obtain high‑frequency maps. The loss combines an L1 term on these maps with the usual reconstruction loss, encouraging the network to match edges and textures pixel‑by‑pixel.
Training on HP‑Image‑40K – The dataset provides paired (masked target, reference, ground‑truth) samples. Automatic filtering removes low‑quality syntheses, allowing the model to learn from diverse poses, lighting, and product categories.

The overall architecture remains a standard encoder‑decoder with skip connections; the novelty lies in the SEA modules and DAL supervision that together steer the network toward high‑fidelity detail preservation.

Results & Findings

Metric (higher is better)	HiFi‑Inpaint	Prior Ref‑Inpaint (e.g., RFR‑Inpaint)
PSNR	31.8 dB	29.4 dB
SSIM	0.94	0.90
LPIPS (lower is better)	0.12	0.18

Visual quality: Side‑by‑side comparisons show HiFi‑Inpaint retaining sharp logos, stitching patterns, and reflective surfaces that other methods blur or distort.
Robustness to pose & lighting: The model consistently inserts products across varied human poses and complex backgrounds, thanks to SEA’s ability to focus on the most relevant reference features.
Ablation studies: Removing SEA drops PSNR by ~1.2 dB, while omitting DAL increases LPIPS by ~0.05, confirming that both components are essential for detail fidelity.

Practical Implications

E‑commerce catalog generation – Retailers can automatically generate model shots for new products without costly photo shoots, ensuring the product’s texture and branding stay intact.
Virtual try‑on & AR – Apps that overlay clothing, accessories, or gadgets onto users’ live camera feeds can leverage HiFi‑Inpaint to produce photorealistic results in real time, improving user confidence.
Marketing automation – Agencies can quickly produce high‑quality ad creatives that combine influencers or models with a library of product images, cutting turnaround time.
Dataset creation – The HP‑Image‑40K dataset can serve as a benchmark for future research on reference‑guided generation, encouraging more industry‑focused solutions.

Limitations & Future Work

Domain shift – The model is trained on synthetic‑plus‑filtered data; performance may degrade on extreme lighting or highly reflective materials not represented in HP‑Image‑40K.
Computation cost – SEA introduces additional attention calculations, modestly increasing inference latency, which could be a bottleneck for real‑time mobile AR.
Single‑reference dependence – The current framework assumes one clean product reference; handling multiple or partially occluded references remains an open challenge.

Future directions include extending the attention mechanism to multi‑reference scenarios, optimizing the architecture for edge devices, and enriching the dataset with more diverse real‑world captures to further close the sim‑to‑real gap.

Authors

Yichen Liu
Donghao Zhou
Jie Wang
Xin Gao
Guisheng Liu
Jiatong Li
Quanwei Zhang
Qiang Lyu
Lanqing Guo
Shilei Wen
Weiqiang Wang
Pheng-Ann Heng

Paper Information

arXiv ID: 2603.02210v1
Categories: cs.CV
Published: March 2, 2026
PDF: Download PDF

[Paper] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Adaptive Confidence Regularization for Multimodal Failure Detection

[Paper] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories

[Paper] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

[Paper] Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta