[Paper] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction

Published: (January 2, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.00796v1

Overview

AdaGaR introduces a new way to reconstruct dynamic 3‑D scenes from a single video stream. By marrying an adaptive Gabor‑based primitive with a temporally‑aware spline model, the authors achieve high‑frequency visual fidelity while keeping motion smooth and free of interpolation artifacts—something that prior Gaussian‑only pipelines struggled with.

Key Contributions

  • Adaptive Gabor Representation (AdaGaR‑G): Extends classic Gaussian blobs with learnable frequency weights and an energy‑compensation term, enabling the model to capture fine textures without destabilizing the rendering.
  • Temporal Continuity via Cubic Hermite Splines: Encodes each primitive’s trajectory with Hermite splines and adds a curvature regularizer, guaranteeing smooth motion across frames.
  • Robust Adaptive Initialization: Combines off‑the‑shelf depth estimation, dense point tracking, and foreground masks to seed a well‑distributed point cloud, accelerating convergence and reducing early‑training artifacts.
  • Unified Training Pipeline: All components are differentiable and optimized end‑to‑end, allowing a single loss to balance appearance (PSNR/SSIM/LPIPS), geometry (depth consistency), and motion smoothness.
  • State‑of‑the‑Art Benchmarks: On the Tap‑Vid and DAVIS dynamic‑scene datasets, AdaGaR achieves PSNR = 35.49, SSIM = 0.9433, and LPIPS = 0.0723, outperforming prior Gaussian‑based and neural‑radiance‑field (NeRF) baselines.

Methodology

  1. Primitive Design – Each scene element is modeled as a Gabor‑like function: a Gaussian envelope multiplied by a sinusoidal carrier. The carrier’s frequency is not fixed; a small neural network predicts a per‑primitive frequency vector that can adapt during training. An energy‑compensation scalar rescales the amplitude to prevent the high‑frequency term from blowing up.
  2. Temporal Modeling – For every primitive, its 3‑D position over time is expressed with a Cubic Hermite Spline (position + tangent at keyframes). A Temporal Curvature Regularizer penalizes rapid changes in the spline’s second derivative, encouraging physically plausible motion.
  3. Adaptive Initialization
    • Depth Estimation: A pre‑trained monocular depth model provides an initial 3‑D point cloud.
    • Point Tracking: Optical‑flow‑based tracking propagates points across frames, giving a rough motion prior.
    • Foreground Masks: Segmentation masks prune background clutter, focusing the primitives on dynamic objects.
      The combined result seeds the Gabor primitives before gradient‑based optimization begins.
  4. Training Objective – A weighted sum of:
    • Photometric loss (L2 + perceptual LPIPS) on rendered frames,
    • Depth consistency loss (align rendered depth with estimated depth),
    • Temporal curvature loss, and
    • Regularizers for frequency magnitude and energy balance.
      All terms are differentiable, so standard Adam optimization suffices.

Results & Findings

DatasetPSNR ↑SSIM ↑LPIPS ↓
Tap‑Vid (dynamic)35.490.94330.0723
DAVIS (high‑motion)34.10.9380.079
  • Detail Preservation: Compared to pure Gaussian models, AdaGaR recovers sharper textures (e.g., hair strands, fabric patterns) thanks to the learned high‑frequency carrier.
  • Motion Smoothness: Interpolated frames show no jitter or ghosting; the curvature regularizer effectively eliminates the “wiggle” artifacts seen in prior works.
  • Generalization: The same trained model can be repurposed for downstream tasks—frame interpolation, depth‑consistent video editing, and even stereo view synthesis—without retraining.

Practical Implications

  • Real‑Time AR/VR Content Creation: Developers can capture a single handheld video and instantly generate a high‑fidelity, animatable 3‑D proxy for immersive experiences.
  • Dynamic Scene Editing: Video editors can manipulate objects (e.g., reposition, recolor) while preserving realistic motion, thanks to the explicit primitive representation.
  • Efficient Storage & Streaming: Because the scene is encoded as a compact set of adaptive Gabor primitives and spline trajectories, bandwidth‑constrained applications (e.g., cloud gaming) can stream a lightweight model instead of full video frames.
  • Robotics & Autonomous Driving: The method’s ability to produce temporally consistent depth maps from monocular footage can improve perception pipelines that need both geometry and motion cues.

Limitations & Future Work

  • Scalability to Large‑Scale Scenes: The current implementation assumes a relatively bounded number of primitives; scaling to city‑scale environments may require hierarchical or sparse representations.
  • Dependency on Pre‑trained Depth/Mask Models: Errors in the initialization stage (e.g., inaccurate depth on reflective surfaces) can propagate into the final reconstruction.
  • Real‑Time Rendering Speed: While more efficient than full NeRFs, rendering still incurs a non‑trivial cost; future work could explore GPU‑accelerated spline evaluation or hybrid rasterization techniques.
  • Extension to Multi‑View Inputs: The authors focus on monocular video; integrating stereo or multi‑camera setups could further boost accuracy and reduce ambiguity in motion estimation.

AdaGaR bridges the gap between high‑frequency visual detail and temporally coherent motion, offering a practical toolkit for developers who need dynamic 3‑D reconstructions without the heavy computational baggage of full neural rendering.

Authors

  • Jiewen Chan
  • Zhenjun Zhao
  • Yu‑Lun Liu

Paper Information

  • arXiv ID: 2601.00796v1
  • Categories: cs.CV
  • Published: January 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »