[Paper] InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Published: (January 6, 2026 at 01:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03252v1

Overview

InfiniDepth tackles a long‑standing bottleneck in monocular depth estimation: the reliance on pixel‑grid outputs that cap resolution and miss fine geometric details. By representing depth as a continuous neural implicit field, the authors enable depth queries at any 2‑D coordinate, opening the door to arbitrarily high‑resolution maps and sharper reconstruction of intricate structures. The paper also introduces a new 4K synthetic benchmark to stress‑test these capabilities.

Key Contributions

  • Neural Implicit Depth Representation – Reformulates depth as a continuous field learned by a lightweight local implicit decoder, allowing depth queries at arbitrary image coordinates.
  • Arbitrary‑Resolution Output – Eliminates the fixed‑grid constraint; developers can request depth at any resolution (e.g., 4K, 8K) without retraining.
  • Fine‑Grained Detail Recovery – Demonstrates superior performance on thin structures, edges, and texture‑rich regions compared with grid‑based baselines.
  • High‑Quality 4K Synthetic Benchmark – Curated from five modern video games, covering diverse indoor/outdoor scenes with rich geometry and realistic lighting.
  • Cross‑Task Benefits – Shows that the implicit depth maps improve novel view synthesis, reducing holes and artifacts under large viewpoint changes.

Methodology

  1. Local Implicit Decoder – The network takes a standard CNN backbone’s feature map and, for each query coordinate ((u, v)), extracts a small local patch of features. These features are fed into a tiny MLP that predicts the depth value at that exact coordinate.
  2. Continuous Querying – Because the decoder is a function of continuous coordinates, depth can be sampled at any resolution on‑the‑fly (e.g., bilinear upsampling is replaced by direct queries).
  3. Training Objective – The model is supervised with a combination of relative depth ranking loss (to preserve scene ordering) and metric L1 loss (to enforce absolute scale), plus a smoothness regularizer that encourages locally coherent surfaces.
  4. Benchmark Construction – Using game engines, the authors rendered RGB‑depth pairs at 4K resolution, ensuring accurate ground‑truth geometry and diverse visual conditions (lighting, materials, motion).

Results & Findings

  • State‑of‑the‑Art Accuracy – On both the new 4K synthetic suite and established real‑world datasets (e.g., NYU‑Depth V2, KITTI), InfiniDepth outperforms prior methods by 5–12% in standard depth metrics (RMSE, δ<1.25).
  • Resolution Scaling – When queried at 8K, the model retains its accuracy, whereas grid‑based baselines degrade sharply because they must upsample from a low‑resolution prediction.
  • Fine‑Detail Gains – Edge‑aware metrics show up to 30% improvement on thin objects (railings, wires) and high‑frequency textures.
  • View Synthesis – Integrated into a neural rendering pipeline, the implicit depth reduces hole‑filling artifacts by 40% and yields smoother novel views under ±30° camera shifts.

Practical Implications

  • Game & VR Development – Developers can generate ultra‑high‑resolution depth maps from a single RGB frame for real‑time effects (e.g., depth‑of‑field, occlusion culling) without pre‑computing dense depth buffers.
  • Robotics & AR – On‑device inference can produce fine‑grained depth at the camera’s native resolution, improving obstacle detection around thin objects that traditional sensors miss.
  • Content Creation Pipelines – Artists can upsample depth for post‑production (e.g., compositing, relighting) without introducing interpolation artifacts, saving time on manual depth editing.
  • Neural Rendering – The implicit depth field integrates cleanly with NeRF‑style view synthesis, enabling higher‑quality novel view generation for telepresence or digital twins.

Limitations & Future Work

  • Inference Overhead – Querying each pixel individually through an MLP is slower than a single forward pass of a dense decoder; the authors mitigate this with batching but real‑time 8K inference still challenges current GPUs.
  • Generalization to Unseen Domains – While the synthetic benchmark is diverse, performance on highly reflective or transparent surfaces (e.g., glass, water) remains modest, suggesting a need for domain‑adaptive training.
  • Memory Footprint – Storing high‑resolution feature maps for local decoding can strain mobile or embedded devices. Future work could explore feature‑compression or hybrid grid‑implicit hybrids.

InfiniDepth demonstrates that moving depth estimation from a discrete grid to a continuous implicit field is not just a theoretical exercise—it unlocks practical, high‑resolution depth for the next generation of visual computing applications.

Authors

  • Hao Yu
  • Haotong Lin
  • Jiawei Wang
  • Jiaxin Li
  • Yida Wang
  • Xueyang Zhang
  • Yue Wang
  • Xiaowei Zhou
  • Ruizhen Hu
  • Sida Peng

Paper Information

  • arXiv ID: 2601.03252v1
  • Categories: cs.CV
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »