[Paper] Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Published: (December 4, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05115v1

Overview

The paper introduces Light‑X, a generative video‑rendering system that lets you simultaneously control the camera path and the lighting conditions of a scene captured in a single monocular video. By disentangling geometry from illumination, Light‑X can produce temporally consistent, photorealistic videos even when the original footage was taken under a single viewpoint and lighting setup—opening the door to dynamic visual effects, virtual cinematography, and interactive content creation.

Key Contributions

  • Joint camera‑and‑illumination control for monocular videos, enabling free‑form viewpoint changes while re‑lighting the scene.
  • Disentangled architecture that separates dynamic geometry (via point‑cloud trajectories) from lighting cues (via relit reference frames).
  • Light‑Syn data pipeline: a degradation‑and‑inverse‑mapping scheme that synthesizes paired multi‑view / multi‑illumination training data from ordinary “in‑the‑wild” videos.
  • Comprehensive dataset covering static, dynamic, and AI‑generated scenes, ensuring robustness across diverse content.
  • State‑of‑the‑art performance on joint control tasks, outperforming existing video‑relighting baselines and handling both text‑driven and background‑conditioned lighting prompts.

Methodology

  1. Dynamic Point‑Cloud Backbone – The input video is first converted into a sequence of point clouds that capture scene geometry and motion. These clouds can be re‑projected from any user‑specified camera trajectory, giving the system a flexible 3‑D representation of the scene.

  2. Illumination Decoder – A separate branch takes a relit reference frame (generated by a conventional image‑relighting model) and projects it onto the same point cloud geometry. Because the geometry is fixed, the lighting information can be consistently transferred across frames, preserving temporal coherence.

  3. Light‑Syn Synthetic Pair Generation – Since real multi‑view/multi‑illumination video pairs are scarce, the authors corrupt a clean video (e.g., by applying random camera motions and lighting changes) and then learn an inverse mapping that recovers the original. This yields synthetic training pairs that mimic the desired joint control scenario without manual labeling.

  4. Training Objective – The network is optimized with a combination of reconstruction loss (to keep the output faithful to the target view), illumination consistency loss (to enforce smooth lighting changes), and adversarial loss (to improve realism).

The overall pipeline can be visualized as: Monocular video → Dynamic point cloud → User‑defined camera path + relit frame → Rendered output video.

Results & Findings

  • Quantitative gains: Light‑X achieves higher PSNR/SSIM scores than leading video‑relighting baselines when evaluated on both synthetic and real‑world test sets.
  • Temporal stability: Flicker‑free results are demonstrated via lower temporal warping error, confirming that the disentangled geometry keeps lighting changes consistent across frames.
  • User studies: Participants preferred Light‑X outputs over baselines for realism and controllability, especially when asked to follow complex camera trajectories combined with dramatic lighting shifts.
  • Generalization: The Light‑Syn‑trained model works well on unseen content, including AI‑generated scenes, indicating that the synthetic data pipeline successfully bridges the domain gap.

Practical Implications

  • Virtual production & VFX: Filmmakers can re‑shoot scenes virtually, moving the camera after the fact and applying cinematic lighting without reshooting on set.
  • Game asset creation: Artists can generate animated cutscenes from a single reference video, instantly exploring different camera angles and mood lighting.
  • AR/VR experiences: Real‑world video streams can be re‑projected into immersive environments where users control viewpoint and illumination in real time.
  • Content personalization: Platforms can offer viewers the ability to “watch a video in daylight vs. night” or from alternative perspectives, enhancing engagement.
  • Rapid prototyping: Designers can iterate on lighting concepts for product demos or architectural walkthroughs without costly multi‑camera rigs.

Limitations & Future Work

  • Complex geometry handling: Extremely fine‑grained details (e.g., hair, translucent objects) can still suffer from artifacts when re‑projected from point clouds.
  • Lighting model scope: The current relighting branch relies on pre‑trained image relighters; extending it to handle indirect illumination or global illumination effects remains an open challenge.
  • Real‑time performance: While the method runs at interactive speeds on high‑end GPUs, achieving mobile‑friendly latency will require further optimization.
  • User‑friendly interfaces: Future work could integrate intuitive UI tools (e.g., natural‑language prompts for lighting) to lower the barrier for non‑technical creators.

Overall, Light‑X marks a significant step toward fully controllable, high‑fidelity video synthesis, and its underlying ideas are poised to influence a wide range of visual‑computing applications.

Authors

  • Tianqi Liu
  • Zhaoxi Chen
  • Zihao Huang
  • Shaocong Xu
  • Saining Zhang
  • Chongjie Ye
  • Bohan Li
  • Zhiguo Cao
  • Wei Li
  • Hao Zhao
  • Ziwei Liu

Paper Information

  • arXiv ID: 2512.05115v1
  • Categories: cs.CV
  • Published: December 4, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »