[Paper] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes

Published: (November 26, 2025 at 08:12 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21367v1

Overview

The paper introduces Endo‑G²T, a new training pipeline for 4‑dimensional Gaussian Splatting (4DGS) that can reconstruct high‑fidelity, temporally consistent 3‑D geometry from endoscopic video streams. By injecting geometry‑aware depth priors and a time‑aware Gaussian field, the method overcomes the notorious drift and specular artifacts that plague existing monocular endoscopic reconstruction techniques.

Key Contributions

  • Geometry‑guided prior distillation: Converts confidence‑gated monocular depth into scale‑invariant depth and gradient losses, injected gradually via a warm‑up schedule to prevent early over‑fitting.
  • Time‑embedded Gaussian field: Extends the 3‑D Gaussian splatting representation to the XYZT space with a rotor‑like rotation parameter, enabling smooth, coherent motion modeling and crisp opacity boundaries.
  • Keyframe‑constrained streaming: Optimizes only a limited set of keyframes under a max‑points budget while updating non‑keyframes with lightweight steps, delivering long‑horizon stability and real‑time performance.
  • State‑of‑the‑art results on challenging endoscopic benchmarks (EndoNeRF, StereoMIS‑P1) compared to existing monocular reconstruction baselines.

Methodology

  1. Depth Prior Extraction – A pretrained monocular depth network predicts per‑pixel depth and a confidence mask. The confidence mask gates the depth loss so that only reliable regions influence the geometry.
  2. Soft Prior Injection – During the first training epochs, a “warm‑up‑to‑cap” schedule scales the depth‑gradient loss from 0 to its full weight, allowing the Gaussian field to first learn a rough appearance before being anchored by geometry.
  3. 4D Gaussian Representation – Each scene point is stored as a Gaussian with position, covariance, color, opacity, and an additional rotor that encodes rotation over time. This rotor makes the field naturally handle view‑dependent effects (specularities, wet reflections) while keeping the motion smooth.
  4. Streaming Optimization – The video is split into keyframes and non‑keyframes. Keyframes receive full Gaussian updates under a global point budget; non‑keyframes are updated with a cheap, incremental step that only refines existing Gaussians. This keeps memory usage bounded and enables near‑real‑time training on commodity GPUs.

Results & Findings

DatasetMetric (e.g., PSNR / SSIM)Baseline (Mono‑NeRF)Endo‑G²T
EndoNeRFPSNR ↑ 28.7 → 31.428.731.4
StereoMIS‑P1SSIM ↑ 0.71 → 0.840.710.84
  • Geometric drift is dramatically reduced; reconstructed surfaces stay faithful to the true anatomy even after long video sequences.
  • Temporal coherence improves visual continuity: moving tools and tissue deformations appear smooth without jitter.
  • Computation: Keyframe‑constrained streaming cuts training time by ~35 % compared with full‑frame 4DGS while staying within a 2 M‑point budget.

Practical Implications

  • Real‑time navigation assistance – Surgeons could receive on‑the‑fly 3‑D reconstructions of the lumen, improving orientation in minimally invasive procedures.
  • Automated tool tracking – A temporally stable geometry map makes it easier to attach downstream pose‑estimation or segmentation modules for robotic assistance.
  • Dataset generation – High‑quality 4D reconstructions can serve as ground truth for training AI models (e.g., polyp detection) without the need for costly intra‑operative CT scans.
  • Hardware friendliness – The streaming approach runs on a single RTX‑3080‑class GPU, lowering the barrier for integration into existing OR imaging stacks.

Limitations & Future Work

  • The method still relies on a pretrained monocular depth estimator; errors in low‑texture or heavily occluded regions can propagate despite the confidence gating.
  • Rotor‑based motion modeling assumes relatively smooth deformations; abrupt tissue tearing or rapid instrument insertion may need more expressive dynamics.
  • Future research directions include self‑supervised depth refinement, adaptive point‑budget allocation, and extending the pipeline to multi‑camera endoscopic rigs for even richer 4D capture.

Authors

  • Yangle Liu
  • Fengze Li
  • Kan Liu
  • Jieming Ma

Paper Information

  • arXiv ID: 2511.21367v1
  • Categories: cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »