[Paper] WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

Published: (January 15, 2026 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.10716v1

Overview

WildRayZer introduces a self‑supervised pipeline that can synthesize novel views of scenes even when both the camera and objects are moving. By automatically detecting and masking out transient (moving) elements, it sidesteps the ghosting and geometry errors that plague traditional static‑scene view synthesis models, making high‑quality, feed‑forward NVS feasible for real‑world, dynamic footage.

Key Contributions

  • Self‑supervised transient detection: Uses a static‑only renderer to generate residuals that act as pseudo motion masks, eliminating the need for manual annotations.
  • Motion‑aware token gating: Masks input tokens and gates loss gradients so the network focuses learning on the static background while still handling dynamic foregrounds.
  • Large‑scale dynamic dataset: Curates Dynamic RealEstate10K (D‑RE10K) (≈15 K casual video sequences) and a paired benchmark D‑RE10K‑iPhone for evaluating transient‑aware NVS.
  • Single‑pass feed‑forward inference: Achieves state‑of‑the‑art quality without costly per‑scene optimization, outperforming both optimization‑based and existing feed‑forward baselines.

Methodology

  1. Static‑only rendering: A conventional NeRF‑style static renderer predicts the rigid background from the input views.
  2. Residual analysis: The difference between the rendered background and the original images highlights regions that cannot be explained by static geometry—i.e., moving objects, lighting changes, etc.
  3. Pseudo motion masks: These residuals are thresholded to produce coarse masks of transient content.
  4. Distilled motion estimator: The pseudo masks train a lightweight motion‑estimation network that predicts per‑pixel motion probabilities for any new view.
  5. Token masking & gradient gating: During training, tokens corresponding to high‑motion areas are masked out, and loss gradients are blocked for those regions, forcing the model to learn robust background completion while still preserving the ability to render moving objects when needed.
  6. End‑to‑end self‑supervision: The whole pipeline is trained without any ground‑truth masks or depth maps, relying solely on the analysis‑by‑synthesis loop.

Results & Findings

  • Quantitative gains: On D‑RE10K‑iPhone, WildRayZer improves PSNR/SSIM by ~1.5 dB and 0.04 respectively over the strongest baselines, while reducing ghosting artifacts in dynamic regions.
  • Transient removal: The distilled motion masks achieve >85 % IoU with manually annotated motion regions, despite being generated without supervision.
  • Speed: A single forward pass (≈0.12 s per view on an RTX 3080) produces full‑resolution novel views, compared to minutes‑long optimization loops in competing methods.
  • Generalization: The model trained on D‑RE10K transfers well to other dynamic video sources (e.g., handheld smartphone tours), maintaining visual fidelity.

Practical Implications

  • Real‑time AR/VR content creation: Developers can generate immersive 3‑D walkthroughs from casual handheld footage without labor‑intensive cleanup of moving people or pets.
  • Dynamic scene reconstruction for robotics: Robots can build reliable static maps of environments while ignoring moving obstacles, improving navigation and SLAM robustness.
  • Content pipelines for games and film: Artists can repurpose on‑set video captures for background plates, automatically stripping out crew movement and props.
  • Scalable cloud services: Since inference is feed‑forward, cloud‑based view‑synthesis APIs can serve dynamic‑scene requests at scale with modest GPU budgets.

Limitations & Future Work

  • Coarse motion masks: The residual‑based masks may miss subtle motions (e.g., small shadows) or over‑mask semi‑static objects, leading to occasional loss of detail.
  • Assumption of dominant static background: Scenes where the majority of the view is dynamic (e.g., crowded festivals) still challenge the static‑renderer backbone.
  • Dataset bias: D‑RE10K focuses on indoor/outdoor residential spaces; broader domain coverage (industrial sites, aerial footage) remains to be explored.
  • Future directions: The authors suggest integrating temporal consistency losses, refining mask granularity with multi‑scale attention, and extending the framework to handle full‑scene deformation (e.g., cloth simulation).

Authors

  • Xuweiyi Chen
  • Wentao Zhou
  • Zezhou Cheng

Paper Information

  • arXiv ID: 2601.10716v1
  • Categories: cs.CV
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »