[Paper] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
Source: arXiv - 2601.00796v1
Overview
AdaGaR introduces a new way to reconstruct dynamic 3‑D scenes from a single video stream. By marrying an adaptive Gabor‑based primitive with a temporally‑aware spline model, the authors achieve high‑frequency visual fidelity while keeping motion smooth and free of interpolation artifacts—something that prior Gaussian‑only pipelines struggled with.
Key Contributions
- Adaptive Gabor Representation (AdaGaR‑G): Extends classic Gaussian blobs with learnable frequency weights and an energy‑compensation term, enabling the model to capture fine textures without destabilizing the rendering.
- Temporal Continuity via Cubic Hermite Splines: Encodes each primitive’s trajectory with Hermite splines and adds a curvature regularizer, guaranteeing smooth motion across frames.
- Robust Adaptive Initialization: Combines off‑the‑shelf depth estimation, dense point tracking, and foreground masks to seed a well‑distributed point cloud, accelerating convergence and reducing early‑training artifacts.
- Unified Training Pipeline: All components are differentiable and optimized end‑to‑end, allowing a single loss to balance appearance (PSNR/SSIM/LPIPS), geometry (depth consistency), and motion smoothness.
- State‑of‑the‑Art Benchmarks: On the Tap‑Vid and DAVIS dynamic‑scene datasets, AdaGaR achieves PSNR = 35.49, SSIM = 0.9433, and LPIPS = 0.0723, outperforming prior Gaussian‑based and neural‑radiance‑field (NeRF) baselines.
Methodology
- Primitive Design – Each scene element is modeled as a Gabor‑like function: a Gaussian envelope multiplied by a sinusoidal carrier. The carrier’s frequency is not fixed; a small neural network predicts a per‑primitive frequency vector that can adapt during training. An energy‑compensation scalar rescales the amplitude to prevent the high‑frequency term from blowing up.
- Temporal Modeling – For every primitive, its 3‑D position over time is expressed with a Cubic Hermite Spline (position + tangent at keyframes). A Temporal Curvature Regularizer penalizes rapid changes in the spline’s second derivative, encouraging physically plausible motion.
- Adaptive Initialization
- Depth Estimation: A pre‑trained monocular depth model provides an initial 3‑D point cloud.
- Point Tracking: Optical‑flow‑based tracking propagates points across frames, giving a rough motion prior.
- Foreground Masks: Segmentation masks prune background clutter, focusing the primitives on dynamic objects.
The combined result seeds the Gabor primitives before gradient‑based optimization begins.
- Training Objective – A weighted sum of:
- Photometric loss (L2 + perceptual LPIPS) on rendered frames,
- Depth consistency loss (align rendered depth with estimated depth),
- Temporal curvature loss, and
- Regularizers for frequency magnitude and energy balance.
All terms are differentiable, so standard Adam optimization suffices.
Results & Findings
| Dataset | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| Tap‑Vid (dynamic) | 35.49 | 0.9433 | 0.0723 |
| DAVIS (high‑motion) | 34.1 | 0.938 | 0.079 |
- Detail Preservation: Compared to pure Gaussian models, AdaGaR recovers sharper textures (e.g., hair strands, fabric patterns) thanks to the learned high‑frequency carrier.
- Motion Smoothness: Interpolated frames show no jitter or ghosting; the curvature regularizer effectively eliminates the “wiggle” artifacts seen in prior works.
- Generalization: The same trained model can be repurposed for downstream tasks—frame interpolation, depth‑consistent video editing, and even stereo view synthesis—without retraining.
Practical Implications
- Real‑Time AR/VR Content Creation: Developers can capture a single handheld video and instantly generate a high‑fidelity, animatable 3‑D proxy for immersive experiences.
- Dynamic Scene Editing: Video editors can manipulate objects (e.g., reposition, recolor) while preserving realistic motion, thanks to the explicit primitive representation.
- Efficient Storage & Streaming: Because the scene is encoded as a compact set of adaptive Gabor primitives and spline trajectories, bandwidth‑constrained applications (e.g., cloud gaming) can stream a lightweight model instead of full video frames.
- Robotics & Autonomous Driving: The method’s ability to produce temporally consistent depth maps from monocular footage can improve perception pipelines that need both geometry and motion cues.
Limitations & Future Work
- Scalability to Large‑Scale Scenes: The current implementation assumes a relatively bounded number of primitives; scaling to city‑scale environments may require hierarchical or sparse representations.
- Dependency on Pre‑trained Depth/Mask Models: Errors in the initialization stage (e.g., inaccurate depth on reflective surfaces) can propagate into the final reconstruction.
- Real‑Time Rendering Speed: While more efficient than full NeRFs, rendering still incurs a non‑trivial cost; future work could explore GPU‑accelerated spline evaluation or hybrid rasterization techniques.
- Extension to Multi‑View Inputs: The authors focus on monocular video; integrating stereo or multi‑camera setups could further boost accuracy and reduce ambiguity in motion estimation.
AdaGaR bridges the gap between high‑frequency visual detail and temporally coherent motion, offering a practical toolkit for developers who need dynamic 3‑D reconstructions without the heavy computational baggage of full neural rendering.
Authors
- Jiewen Chan
- Zhenjun Zhao
- Yu‑Lun Liu
Paper Information
- arXiv ID: 2601.00796v1
- Categories: cs.CV
- Published: January 2, 2026
- PDF: Download PDF