[Paper] Relit-LiVE: Relight Video by Jointly Learning Environment Video
Source: arXiv - 2605.06658v1
Overview
Relit‑LiVE tackles a long‑standing problem in computer vision: how to change the lighting of an existing video while keeping the scene’s appearance physically plausible and temporally stable. By combining raw video frames with a diffusion‑based environment‑map predictor, the authors achieve high‑quality relighting without needing camera poses or perfect intrinsic decompositions—something that has limited prior methods on real‑world footage.
Key Contributions
- Reference‑guided diffusion rendering – raw input frames are injected into the diffusion process, letting the model recover lost scene cues that intrinsic decompositions usually miss.
- Joint video‑and‑environment‑map prediction – a single diffusion model simultaneously outputs the relit video and per‑frame environment maps aligned to the current view, enforcing geometry‑illumination consistency.
- Pose‑free operation – the framework works without explicit per‑frame camera pose information, handling dynamic lighting and camera motion out‑of‑the‑box.
- Broad downstream utility – beyond relighting, the same pipeline supports material editing, object insertion, and even live streaming relighting.
- State‑of‑the‑art performance – extensive benchmarks on synthetic and real‑world datasets show consistent gains over existing video relighting and neural rendering baselines.
Methodology
- Input preprocessing – the source video is split into frames; each frame is passed through a lightweight intrinsic estimator (albedo, normals, depth) only to provide a coarse guide.
- Reference injection – the original RGB frame is concatenated with the intrinsic maps and fed as a conditioning signal to a video diffusion model. This lets the network “look back” at the true pixel values when needed, preventing the drift that pure intrinsic‑only pipelines suffer from.
- Environment video diffusion – the diffusion model is trained to predict, for every timestep, a per‑frame environment map (a 2‑D illumination representation) that is spatially aligned with the current camera view. The environment map and the relit frame are generated together in a single forward pass.
- Temporal consistency – a temporal attention block ties together neighboring frames inside the diffusion backbone, encouraging smooth lighting transitions and suppressing flicker.
- Training objective – a combination of reconstruction loss (pixel‑wise L2), perceptual loss (VGG‑based), and a physics‑based shading loss that penalizes mismatches between the predicted environment map, geometry, and the rendered appearance.
The whole pipeline runs end‑to‑end on a single GPU, requiring only the raw video as input.
Results & Findings
- Quantitative gains: On the Real‑World Relighting Benchmark (RWRB), Relit‑LiVE improves PSNR by ~2.1 dB and LPIPS by ~0.08 over the previous best method.
- Temporal stability: A new temporal‑flicker metric shows a 35 % reduction in frame‑to‑frame variance compared to baselines.
- Robustness to pose errors: Experiments where synthetic camera poses are deliberately corrupted demonstrate that Relit‑LiVE’s performance degrades gracefully, while pose‑dependent methods fail dramatically.
- Real‑world demos: The authors showcase relighting of handheld smartphone footage, outdoor street scenes, and indoor talk‑show recordings, all with natural‑looking shadows and specular highlights.
Practical Implications
- Post‑production lighting – filmmakers and content creators can adjust lighting after shooting, saving time and equipment on set.
- AR/VR asset integration – developers can insert virtual objects into existing video streams and have the lighting automatically match the surrounding environment.
- Live streaming – broadcasters could apply dynamic lighting effects (e.g., day‑to‑night transitions) in real time without pre‑computing scene geometry.
- Game engine pipelines – the joint environment‑map prediction can feed directly into real‑time renderers for consistent illumination across cutscenes and gameplay footage.
- Privacy‑preserving visual effects – because the method does not need explicit camera pose data, it can be deployed on edge devices where pose estimation is undesirable or infeasible.
Limitations & Future Work
- Intrinsic estimator reliance – while the raw frames mitigate errors, extremely noisy or low‑resolution inputs still lead to sub‑optimal relighting.
- Computational cost – diffusion inference remains slower than traditional rasterization; real‑time streaming will need further acceleration (e.g., distillation or specialized hardware).
- Dynamic geometry – the current formulation assumes a static scene geometry per frame; handling deformable objects or large‑scale scene changes remains an open challenge.
- Environment map resolution – the predicted environment maps are limited to a modest resolution, which can affect high‑frequency specular detail; future work may explore hierarchical or neural‑field representations.
Overall, Relit‑LiVE pushes video relighting toward practical, production‑ready use cases, opening the door for more flexible lighting workflows across film, gaming, and AR/VR.
Authors
- Weiqing Xiao
- Hong Li
- Xiuyu Yang
- Houyuan Chen
- Wenyi Li
- Tianqi Liu
- Shaocong Xu
- Chongjie Ye
- Hao Zhao
- Beibei Wang
Paper Information
- arXiv ID: 2605.06658v1
- Categories: cs.CV
- Published: May 7, 2026
- PDF: Download PDF