[Paper] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
Source: arXiv - 2603.05507v1
Overview
The paper presents a transformer‑based inpainting module that plugs into any multi‑camera 3D streaming pipeline and fills the holes that appear when rendering novel views in real time. By treating the hole‑filling as a post‑processing step, the authors keep the core streaming system untouched while delivering smoother, artifact‑free visuals—an advance that matters for AR/VR, remote collaboration, and live‑event broadcasting.
Key Contributions
- A universal, representation‑agnostic inpainting plug‑in that works with any calibrated multi‑camera rig, regardless of the underlying 3D reconstruction method.
- Multi‑view aware transformer architecture that incorporates spatio‑temporal embeddings, guaranteeing temporal coherence and cross‑view consistency.
- Resolution‑independent design that scales from low‑cost 4‑camera rigs to high‑density studio setups without retraining the whole network.
- Adaptive patch selection that dynamically balances inference speed and visual quality, enabling true real‑time performance (≈30 fps on a single RTX‑3080).
- Comprehensive benchmark against state‑of‑the‑art image and video inpainting methods under identical latency constraints, showing superior trade‑offs in both PSNR/SSIM and perceptual metrics.
Methodology
- Input preparation – After the novel‑view synthesis step, the rendered frame contains “holes” (missing texels) where no camera observed the surface. A binary mask marks these regions.
- Spatio‑temporal embedding – Each pixel is enriched with three cues: (a) its 2‑D image coordinates, (b) the time‑step index, and (c) a view‑id embedding that tells the network which camera contributed the surrounding context. These embeddings are added to the token vectors fed to the transformer.
- Transformer backbone – A lightweight Vision Transformer (ViT) processes the token sequence. Self‑attention layers let the model blend information from neighboring pixels and from adjacent frames, enforcing temporal smoothness.
- Adaptive patch selection – Instead of feeding the whole frame, the system extracts a set of overlapping patches around each hole. The patch size is chosen on‑the‑fly based on hole geometry and available compute budget, reducing unnecessary processing.
- Reconstruction & blending – The transformer predicts RGB values for the masked pixels. The output is composited back into the original frame using a simple feathered blend to avoid seams.
All steps are implemented in PyTorch with CUDA kernels, and the whole pipeline can be called as a single function inpaint(frame, mask, prev_frames).
Results & Findings
| Method | Avg. PSNR (dB) | SSIM | Inference Time (ms) |
|---|---|---|---|
| DeepFill v2 (single‑image) | 28.4 | 0.84 | 120 |
| Video‑Inpainting (Flow‑guided) | 29.1 | 0.86 | 95 |
| Proposed Transformer | 30.7 | 0.89 | 33 |
- Temporal consistency: The proposed model reduced flicker artifacts by ~70 % compared to the best video‑inpainting baseline (measured with the T‑LPIPS metric).
- Scalability: When up‑sampling the input resolution from 720p to 1080p, quality improved linearly while inference time grew only ~1.2× thanks to the adaptive patch scheme.
- Ablation: Removing the view‑id embedding caused a 1.2 dB drop in PSNR, confirming the importance of multi‑view awareness.
Overall, the method delivers the highest visual fidelity among real‑time‑capable inpainting solutions while staying well within the latency budget required for interactive AR/VR (≤ 35 ms per frame).
Practical ImpImplications
- AR/VR developers can integrate the module into existing streaming stacks (e.g., Unity, Unreal, custom WebGL pipelines) without re‑engineering the 3‑D reconstruction stage.
- Live‑event broadcasters can deploy lower‑cost camera arrays (4‑8 cams) and still achieve studio‑grade fill‑in, reducing hardware expenses.
- Remote collaboration tools (digital twins, telepresence) benefit from smoother visual updates, which translates to less motion sickness and higher user comfort.
- Edge deployment: Because the model runs on a single GPU and supports dynamic patch sizing, it can be hosted on edge servers or even high‑end laptops, opening the door to on‑device streaming scenarios.
Limitations & Future Work
- The approach assumes accurate camera calibration; mis‑alignments can propagate errors into the inpainting stage.
- Extremely large holes (e.g., > 30 % of the frame) still challenge the transformer, leading to blurry reconstructions.
- Current experiments focus on indoor, moderately lit scenes; outdoor lighting variations and strong specularities were not extensively tested.
- Future research directions include: (1) joint optimization of the rendering and inpainting modules, (2) self‑supervised fine‑tuning on‑the‑fly for specific environments, and (3) extending the architecture to handle depth‑aware hole filling for volumetric streaming.
Authors
- Leif Van Holland
- Domenic Zingsheim
- Mana Takhsha
- Hannah Dröge
- Patrick Stotko
- Markus Plack
- Reinhard Klein
Paper Information
- arXiv ID: 2603.05507v1
- Categories: cs.CV, cs.GR
- Published: March 5, 2026
- PDF: Download PDF