[Paper] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

Published: 9 hours ago (March 5, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.05507v1

Overview

The paper presents a transformer‑based inpainting module that plugs into any multi‑camera 3D streaming pipeline and fills the holes that appear when rendering novel views in real time. By treating the hole‑filling as a post‑processing step, the authors keep the core streaming system untouched while delivering smoother, artifact‑free visuals—an advance that matters for AR/VR, remote collaboration, and live‑event broadcasting.

Key Contributions

A universal, representation‑agnostic inpainting plug‑in that works with any calibrated multi‑camera rig, regardless of the underlying 3D reconstruction method.
Multi‑view aware transformer architecture that incorporates spatio‑temporal embeddings, guaranteeing temporal coherence and cross‑view consistency.
Resolution‑independent design that scales from low‑cost 4‑camera rigs to high‑density studio setups without retraining the whole network.
Adaptive patch selection that dynamically balances inference speed and visual quality, enabling true real‑time performance (≈30 fps on a single RTX‑3080).
Comprehensive benchmark against state‑of‑the‑art image and video inpainting methods under identical latency constraints, showing superior trade‑offs in both PSNR/SSIM and perceptual metrics.

Methodology

Input preparation – After the novel‑view synthesis step, the rendered frame contains “holes” (missing texels) where no camera observed the surface. A binary mask marks these regions.
Spatio‑temporal embedding – Each pixel is enriched with three cues: (a) its 2‑D image coordinates, (b) the time‑step index, and (c) a view‑id embedding that tells the network which camera contributed the surrounding context. These embeddings are added to the token vectors fed to the transformer.
Transformer backbone – A lightweight Vision Transformer (ViT) processes the token sequence. Self‑attention layers let the model blend information from neighboring pixels and from adjacent frames, enforcing temporal smoothness.
Adaptive patch selection – Instead of feeding the whole frame, the system extracts a set of overlapping patches around each hole. The patch size is chosen on‑the‑fly based on hole geometry and available compute budget, reducing unnecessary processing.
Reconstruction & blending – The transformer predicts RGB values for the masked pixels. The output is composited back into the original frame using a simple feathered blend to avoid seams.

All steps are implemented in PyTorch with CUDA kernels, and the whole pipeline can be called as a single function inpaint(frame, mask, prev_frames).

Results & Findings

Method	Avg. PSNR (dB)	SSIM	Inference Time (ms)
DeepFill v2 (single‑image)	28.4	0.84	120
Video‑Inpainting (Flow‑guided)	29.1	0.86	95
Proposed Transformer	30.7	0.89	33

Temporal consistency: The proposed model reduced flicker artifacts by ~70 % compared to the best video‑inpainting baseline (measured with the T‑LPIPS metric).
Scalability: When up‑sampling the input resolution from 720p to 1080p, quality improved linearly while inference time grew only ~1.2× thanks to the adaptive patch scheme.
Ablation: Removing the view‑id embedding caused a 1.2 dB drop in PSNR, confirming the importance of multi‑view awareness.

Overall, the method delivers the highest visual fidelity among real‑time‑capable inpainting solutions while staying well within the latency budget required for interactive AR/VR (≤ 35 ms per frame).

Practical ImpImplications

AR/VR developers can integrate the module into existing streaming stacks (e.g., Unity, Unreal, custom WebGL pipelines) without re‑engineering the 3‑D reconstruction stage.
Live‑event broadcasters can deploy lower‑cost camera arrays (4‑8 cams) and still achieve studio‑grade fill‑in, reducing hardware expenses.
Remote collaboration tools (digital twins, telepresence) benefit from smoother visual updates, which translates to less motion sickness and higher user comfort.
Edge deployment: Because the model runs on a single GPU and supports dynamic patch sizing, it can be hosted on edge servers or even high‑end laptops, opening the door to on‑device streaming scenarios.

Limitations & Future Work

The approach assumes accurate camera calibration; mis‑alignments can propagate errors into the inpainting stage.
Extremely large holes (e.g., > 30 % of the frame) still challenge the transformer, leading to blurry reconstructions.
Current experiments focus on indoor, moderately lit scenes; outdoor lighting variations and strong specularities were not extensively tested.
Future research directions include: (1) joint optimization of the rendering and inpainting modules, (2) self‑supervised fine‑tuning on‑the‑fly for specific environments, and (3) extending the architecture to handle depth‑aware hole filling for volumetric streaming.

Authors

Leif Van Holland
Domenic Zingsheim
Mana Takhsha
Hannah Dröge
Patrick Stotko
Markus Plack
Reinhard Klein

Paper Information

arXiv ID: 2603.05507v1
Categories: cs.CV, cs.GR
Published: March 5, 2026
PDF: Download PDF

[Paper] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

Overview

Key Contributions

Methodology

Results & Findings

Practical ImpImplications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

[Paper] Accelerating Text-to-Video Generation with Calibrated Sparse Attention

[Paper] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

[Paper] Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields