[Paper] ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting

Published: 11 hours ago (March 9, 2026 at 01:38 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.08661v1

Overview

The paper introduces ImprovedGS+, a ground‑up C++/CUDA re‑implementation of the popular 3D Gaussian Splatting (3DGS) pipeline. By moving the heavy lifting from Python to native GPU kernels, the authors dramatically cut training time and memory use while still delivering top‑tier visual quality—making real‑time‑ish 3D scene reconstruction a realistic target for developers.

Key Contributions

Native C++/CUDA Engine: Rewrites the entire ImprovedGS workflow as low‑level kernels inside the LichtFeld‑Studio framework, eliminating costly Python‑GPU hand‑offs.
Long‑Axis‑Split (LAS) Kernel: A custom CUDA routine that partitions Gaussian splats along their longest axis, reducing thread divergence and synchronization overhead.
Laplacian‑Based Importance + NMS: Edge‑aware importance weighting with non‑maximum suppression to focus compute on high‑frequency regions.
Adaptive Exponential Scale Scheduler: Dynamically adjusts Gaussian scales during training, improving convergence speed and final fidelity.
Pareto‑Optimal Performance: Demonstrates a new front on the Mip‑NeRF360 benchmark—faster training and higher PSNR with fewer Gaussians.

Methodology

Framework Migration – The original ImprovedGS pipeline (Python + PyTorch) was ported to the LichtFeld‑Studio C++ core. All data structures (Gaussian parameters, feature tensors, etc.) now live directly in GPU memory.
Kernel Design –
- LAS: For each Gaussian, the kernel computes its principal axis, splits the splat into two sub‑splats aligned with the longest dimension, and processes them in parallel. This reduces warp idle time.
- Importance & NMS: A Laplacian filter extracts edge strength per pixel; a fast NMS pass keeps only the strongest responses, guiding the optimizer to allocate Gaussians where they matter most.
Training Loop – Host‑device synchronization points were collapsed to a single per‑iteration barrier. The optimizer now updates positions, covariances, and colors directly on the device, cutting the “Python‑GPU round‑trip” latency.
Scale Scheduler – An exponential decay schedule, but with adaptive resets based on loss plateau detection, lets the model quickly shrink Gaussians in low‑detail zones while preserving detail where needed.

Results & Findings

Variant	Training Time (min)	# Gaussians	PSNR (dB)	Δ vs. Baseline
ImprovedGS+ (1M‑budget)	≈ 73 (‑26.8 %)	≈ 1.33 M (‑13.3 %)	30.2	Faster & leaner than MCMC
ImprovedGS+ (full)	112	2.1 M	31.5 (+1.28 dB)	38.4 % fewer params, higher quality than ADC

Speed: The C++/CUDA stack saves ~17 minutes per training session compared to the Python baseline.
Quality: Despite using fewer Gaussians, the 1M‑budget version matches or exceeds visual fidelity of state‑of‑the‑art methods.
Scalability: The adaptive scheduler keeps memory footprints modest even when scaling to millions of Gaussians, preserving interactivity for larger scenes.

Practical Implications

Faster Prototyping – Developers can iterate on scene capture and reconstruction pipelines in under two hours, a huge productivity boost for AR/VR content pipelines.
Edge‑Device Feasibility – The reduced parametric load means 3DGS can now run on high‑end mobile GPUs or embedded platforms with limited VRAM, opening doors for on‑device scanning apps.
Integration Ready – Because the implementation lives inside LichtFeld‑Studio, existing tools (e.g., real‑time view synthesis, mixed‑reality editors) can plug in ImprovedGS+ with minimal API changes.
Cost Savings – Shorter training translates directly to lower cloud GPU bills for studios that render large datasets (e.g., digital twins, game asset generation).

Limitations & Future Work

Hardware Specificity – The current kernels are tuned for NVIDIA CUDA; porting to AMD or Apple Silicon will require a separate rewrite or reliance on SYCL/Metal.
Dataset Scope – Experiments focus on Mip‑NeRF360; broader validation on outdoor LiDAR scans or highly dynamic scenes is still pending.
Usability Layer – While the core engine is fast, the surrounding Python‑level tooling for data preprocessing and post‑processing still lags behind the low‑level speed gains.
Future Directions – The authors suggest exploring mixed‑precision kernels, auto‑tuning of the LAS split factor, and integrating learned importance maps to further reduce the Gaussian count without sacrificing detail.

Authors

Jordi Muñoz Vicente

Paper Information

arXiv ID: 2603.08661v1
Categories: cs.CV
Published: March 9, 2026
PDF: Download PDF

[Paper] ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Space Diffusion

[Paper] FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models

[Paper] HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

[Paper] ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation