[Paper] HarmoVid: Relightful Video Portrait Harmonization

Published: 2 weeks ago (May 27, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.28811v1

Overview

The paper introduces HarmoVid, a novel system that automatically adjusts the lighting of a foreground video so it blends seamlessly with a new background scene. By handling shadows, color tones, and illumination intensity in a temporally stable way, HarmoVid makes “relightful” video portrait harmonization practical for real‑world production pipelines.

Key Contributions

Video‑focused lighting harmonization that works on full‑length clips, not just single frames.
Lighting deflickering module that removes both global and local flicker caused by naïve frame‑by‑frame processing.
Diffusion‑based video generation trained on a mix of real and synthetically created video pairs, enabling high‑quality, temporally coherent results.
Asymmetric alpha‑mask conditioning that learns clean foreground‑background boundaries directly from real video data.
Comprehensive evaluation showing superior temporal coherence, naturalness, and relighting flexibility compared with existing image‑ and video‑based methods.

Methodology

Data Preparation – Since paired videos captured under identical motion but different lighting are scarce, the authors first apply an off‑the‑shelf image harmonizer to each frame of existing videos. This creates a rough “harmonized” version but introduces temporal jitter.
Deflickering Network – A dedicated neural module analyses the flickering patterns and learns to smooth out inconsistencies both across the whole frame (global illumination) and within local regions (shadows, highlights). The output is a clean, temporally stable video pair.
Video Diffusion Model – Using the deflickered pairs, a conditional diffusion model is trained to predict the harmonized video given a foreground clip and a target background. Diffusion models excel at generating high‑fidelity visual content while preserving fine details.
Asymmetric Alpha‑Mask Conditioning – Instead of feeding a binary mask directly, the model receives an asymmetric version where the mask is blurred on the foreground side. This encourages the network to learn precise edge handling and avoid halo artifacts.
Training Mix – The system is trained on a curated blend of real‑world videos (captured in studios) and synthetically rendered clips, giving it exposure to a wide range of lighting conditions and motion patterns.

Results & Findings

Temporal Coherence: Quantitative metrics (e.g., warping error, flicker score) show a 30‑40 % reduction in temporal artifacts compared to frame‑wise baselines.
Visual Naturalness: User studies rate HarmoVid’s outputs as more realistic and better blended than prior video harmonization tools.
Boundary Cleanliness: The asymmetric mask conditioning yields sharper, halo‑free edges, especially around hair and semi‑transparent regions.
Relighting Expressiveness: The model can handle dramatic illumination changes (e.g., daylight to sunset) while preserving the subject’s identity and texture.

Practical Implications

Film & VFX Production: Editors can replace or augment backgrounds (green‑screen, virtual sets) without manually rotoscoping lighting per frame, saving weeks of labor.
Live Streaming & AR: Real‑time applications can adapt a presenter’s lighting to match dynamic virtual environments, improving visual quality for remote collaboration.
Content Creation Platforms: Social‑media tools can offer “auto‑relight” filters that keep user‑generated videos consistent across varied shooting conditions.
Game Cinematics & Cutscenes: Developers can reuse captured actor performances across multiple lighting setups, reducing the need for re‑shoots.

Limitations & Future Work

Extreme Lighting Gaps: The model may struggle when the source and target lighting differ beyond the range seen during training (e.g., indoor fluorescent vs. outdoor sunset).
Computation Cost: Diffusion inference is still relatively heavy; real‑time deployment would require model pruning or specialized hardware.
Dynamic Occlusions: Rapidly changing occlusions (e.g., hands covering the face) can cause occasional boundary artifacts.

Future research directions include extending the training set with more diverse synthetic lighting, optimizing the diffusion pipeline for low‑latency inference, and integrating depth cues to better handle complex occlusions.

Authors

Jun Myeong Choi
Jae Shin Yoon
Luchao Qi
Roni Sengupta
Joon-Young Lee

Paper Information

arXiv ID: 2605.28811v1
Categories: cs.CV
Published: May 27, 2026
PDF: Download PDF

[Paper] HarmoVid: Relightful Video Portrait Harmonization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

[Paper] KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

[Paper] TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

[Paper] Vision-Language Models Suppress Female Representations Under Ambiguous Input