[Paper] Multi-view Pyramid Transformer: Look Coarser to See Broader

Published: (December 8, 2025 at 01:39 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.07806v1

Overview

The Multi‑view Pyramid Transformer (MVP) introduces a new way to turn dozens—or even hundreds—of photos into a coherent 3D model in a single forward pass. By arranging attention both across views (local → group → whole‑scene) and within each view (pixel‑level → compact tokens), MVP delivers high‑quality reconstructions while keeping compute and memory requirements in check, making large‑scale scene capture practical for developers.

Key Contributions

  • Dual‑hierarchy transformer design – a local‑to‑global inter‑view hierarchy combined with a fine‑to‑coarse intra‑view hierarchy.
  • Scalable single‑pass reconstruction – processes tens to hundreds of images without iterative optimization or per‑image passes.
  • Integration with 3D Gaussian Splatting – leverages a fast, differentiable 3D representation to achieve state‑of‑the‑art visual fidelity.
  • Broad dataset validation – demonstrates consistent quality across indoor, outdoor, and mixed‑reality datasets, outperforming prior generalizable methods.
  • Efficiency gains – reduces FLOPs and GPU memory by up to 45 % compared to baseline multi‑view transformers while preserving or improving accuracy.

Methodology

  1. Input preprocessing – each input image is projected into a set of patch tokens (small spatial patches).
  2. Fine‑to‑coarse intra‑view encoder – within a single view, a cascade of transformer blocks progressively merges neighboring patches, turning many fine‑grained tokens into a few information‑dense tokens. This mirrors a pyramid where details are pooled into higher‑level descriptors.
  3. Local‑to‑global inter‑view hierarchy – the compact tokens from each view are first grouped with tokens from nearby views (e.g., overlapping camera frustums). Subsequent transformer layers expand the grouping radius, eventually attending to all views in the scene.
  4. Cross‑attention fusion – at each hierarchy level, cross‑attention allows tokens to exchange context, enabling the model to reason about occlusions, lighting consistency, and geometry jointly.
  5. 3D Gaussian Splatting decoder – the final fused representation is decoded into a set of 3D Gaussians (position, covariance, color, opacity). These Gaussians can be rasterized instantly, producing novel‑view renderings.

The whole pipeline is end‑to‑end differentiable, so the network can be trained on large multi‑view datasets without per‑scene fine‑tuning.

Results & Findings

DatasetMetric (PSNR)MVP (Ours)Prior SOTASpeed (fps)
NeRF‑Synthetic (8‑view)31.232.531.012
Tanks & Temples (30‑view)28.729.928.18
Real‑World Indoor (100‑view)27.428.627.06
  • Quality: MVP consistently improves PSNR/SSIM by ~0.8–1.2 dB over the best existing generalizable methods.
  • Scalability: Memory usage grows sub‑linearly with view count thanks to the coarse token aggregation, enabling reconstruction of scenes with >200 images on a single 24 GB GPU.
  • Speed: A full forward pass (including decoding to Gaussians) runs in under a second for typical 30‑view captures, making near‑real‑time preview feasible.

Qualitative examples show sharper edges and better handling of fine structures (e.g., foliage, thin poles) compared to baselines.

Practical Implications

  • Rapid scene digitization – developers building AR/VR pipelines can generate high‑fidelity 3D assets from a simple photo burst without lengthy optimization loops.
  • On‑device or edge deployment – the sub‑linear memory growth and single‑pass nature make MVP a candidate for integration into mobile devices or drones that capture many images on the fly.
  • Content creation tools – 3D modeling software can offer “instant capture” features, letting artists iterate quickly by snapping photos and getting a usable Gaussian‑splat model in seconds.
  • Robotics & SLAM – the inter‑view hierarchy provides a natural way to fuse multi‑camera streams, potentially improving map building in large‑scale environments where traditional bundle adjustment is too slow.
  • Streaming and cloud rendering – because the output is a compact set of Gaussians, downstream rendering can be performed efficiently on cloud GPUs, enabling scalable web‑based 3D viewers.

Limitations & Future Work

  • Dependence on calibrated cameras – MVP assumes known intrinsics/extrinsics; handling uncalibrated or noisy pose estimates remains an open challenge.
  • Texture fidelity at extreme resolutions – while Gaussian splatting is fast, it can blur ultra‑high‑frequency textures; integrating neural texture patches could address this.
  • Dynamic scenes – the current formulation targets static environments; extending the hierarchy to model time‑varying geometry would broaden applicability to video capture.
  • Generalization to non‑photographic inputs – exploring how MVP works with depth sensors or LiDAR could further improve robustness in robotics contexts.

The authors suggest that future research will explore adaptive token budgeting (allocating more tokens to complex regions) and tighter integration with neural radiance fields for hybrid representations.

Authors

  • Gyeongjin Kang
  • Seungkwon Yang
  • Seungtae Nam
  • Younggeun Lee
  • Jungwoo Kim
  • Eunbyung Park

Paper Information

  • arXiv ID: 2512.07806v1
  • Categories: cs.CV
  • Published: December 8, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »