[Paper] HarmoVid: Relightful Video Portrait Harmonization
Source: arXiv - 2605.28811v1
Overview
The paper introduces HarmoVid, a novel system that automatically adjusts the lighting of a foreground video so it blends seamlessly with a new background scene. By handling shadows, color tones, and illumination intensity in a temporally stable way, HarmoVid makes “relightful” video portrait harmonization practical for real‑world production pipelines.
Key Contributions
- Video‑focused lighting harmonization that works on full‑length clips, not just single frames.
- Lighting deflickering module that removes both global and local flicker caused by naïve frame‑by‑frame processing.
- Diffusion‑based video generation trained on a mix of real and synthetically created video pairs, enabling high‑quality, temporally coherent results.
- Asymmetric alpha‑mask conditioning that learns clean foreground‑background boundaries directly from real video data.
- Comprehensive evaluation showing superior temporal coherence, naturalness, and relighting flexibility compared with existing image‑ and video‑based methods.
Methodology
- Data Preparation – Since paired videos captured under identical motion but different lighting are scarce, the authors first apply an off‑the‑shelf image harmonizer to each frame of existing videos. This creates a rough “harmonized” version but introduces temporal jitter.
- Deflickering Network – A dedicated neural module analyses the flickering patterns and learns to smooth out inconsistencies both across the whole frame (global illumination) and within local regions (shadows, highlights). The output is a clean, temporally stable video pair.
- Video Diffusion Model – Using the deflickered pairs, a conditional diffusion model is trained to predict the harmonized video given a foreground clip and a target background. Diffusion models excel at generating high‑fidelity visual content while preserving fine details.
- Asymmetric Alpha‑Mask Conditioning – Instead of feeding a binary mask directly, the model receives an asymmetric version where the mask is blurred on the foreground side. This encourages the network to learn precise edge handling and avoid halo artifacts.
- Training Mix – The system is trained on a curated blend of real‑world videos (captured in studios) and synthetically rendered clips, giving it exposure to a wide range of lighting conditions and motion patterns.
Results & Findings
- Temporal Coherence: Quantitative metrics (e.g., warping error, flicker score) show a 30‑40 % reduction in temporal artifacts compared to frame‑wise baselines.
- Visual Naturalness: User studies rate HarmoVid’s outputs as more realistic and better blended than prior video harmonization tools.
- Boundary Cleanliness: The asymmetric mask conditioning yields sharper, halo‑free edges, especially around hair and semi‑transparent regions.
- Relighting Expressiveness: The model can handle dramatic illumination changes (e.g., daylight to sunset) while preserving the subject’s identity and texture.
Practical Implications
- Film & VFX Production: Editors can replace or augment backgrounds (green‑screen, virtual sets) without manually rotoscoping lighting per frame, saving weeks of labor.
- Live Streaming & AR: Real‑time applications can adapt a presenter’s lighting to match dynamic virtual environments, improving visual quality for remote collaboration.
- Content Creation Platforms: Social‑media tools can offer “auto‑relight” filters that keep user‑generated videos consistent across varied shooting conditions.
- Game Cinematics & Cutscenes: Developers can reuse captured actor performances across multiple lighting setups, reducing the need for re‑shoots.
Limitations & Future Work
- Extreme Lighting Gaps: The model may struggle when the source and target lighting differ beyond the range seen during training (e.g., indoor fluorescent vs. outdoor sunset).
- Computation Cost: Diffusion inference is still relatively heavy; real‑time deployment would require model pruning or specialized hardware.
- Dynamic Occlusions: Rapidly changing occlusions (e.g., hands covering the face) can cause occasional boundary artifacts.
Future research directions include extending the training set with more diverse synthetic lighting, optimizing the diffusion pipeline for low‑latency inference, and integrating depth cues to better handle complex occlusions.
Authors
- Jun Myeong Choi
- Jae Shin Yoon
- Luchao Qi
- Roni Sengupta
- Joon-Young Lee
Paper Information
- arXiv ID: 2605.28811v1
- Categories: cs.CV
- Published: May 27, 2026
- PDF: Download PDF