[Paper] Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

Published: 5 days ago (May 5, 2026 at 01:55 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.04035v1

Overview

HeadsUp introduces a feed‑forward pipeline that can turn thousands of high‑resolution multi‑camera images into a detailed 3‑D head model represented by Gaussian splats. By learning a compact latent code for each subject, the system can reconstruct a new head in a single forward pass—no per‑subject optimization required—making it practical for large‑scale production pipelines such as avatar creation, virtual production, or AR/VR experiences.

Key Contributions

Scalable encoder‑decoder architecture that compresses arbitrary numbers of input views into a fixed‑size latent vector.
UV‑parameterized 3‑D Gaussian representation anchored to a neutral head template, decoupling the number of Gaussians from image resolution or view count.
Training on an unprecedented dataset of >10 k subjects (≈10× larger than prior multi‑view head corpora), demonstrating strong generalization to unseen identities.
State‑of‑the‑art reconstruction quality without any test‑time optimization, outperforming existing neural rendering and mesh‑based methods.
Demonstrated downstream utilities: (1) latent‑space interpolation for generating novel 3‑D identities, and (2) driving the reconstructed heads with expression blendshapes for real‑time animation.

Methodology

Data Ingestion – Multi‑camera rigs capture dozens of high‑resolution RGB images of a subject’s head from many angles.
Encoder – A lightweight CNN processes each view independently, extracting per‑view features. These are pooled (e.g., max‑pool or attention) into a single latent vector that summarizes the subject’s geometry and appearance.
Decoder – The latent vector is fed into a fully‑connected decoder that predicts parameters for a dense set of 3‑D Gaussians placed on a UV‑mapped neutral head template. Each Gaussian stores position, covariance (shape), color, and opacity.
Rendering – At inference time, the Gaussian cloud is rasterized using splatting (similar to the popular “3‑D Gaussian Splatting” technique), producing photorealistic novel‑view images.
Training Objective – A combination of multi‑view photometric loss, perceptual loss, and regularizers on Gaussian size/overlap ensures both fidelity and stability.

Because the UV layout ties every Gaussian to a fixed location on the template, the number of Gaussians stays constant regardless of how many input images are used, enabling the model to ingest very high‑resolution data without blowing up memory.

Results & Findings

Quantitative: HeadsUp reduces LPIPS (Learned Perceptual Image Patch Similarity) by ~15 % and improves PSNR by ~2 dB compared to the best prior multi‑view head reconstruction baseline.
Qualitative: Reconstructed heads preserve fine details such as hair strands, subtle skin texture, and accurate ear geometry, even when only 8–12 views are supplied at test time.
Scalability: Experiments varying the number of training subjects, input views, and decoder capacity show a predictable trade‑off: doubling the latent dimension yields ~0.5 dB PSNR gain, while adding more views beyond 20 yields diminishing returns.
Generalization: On a held‑out set of 1 k identities, the model achieves comparable quality to per‑subject optimized methods, confirming that the learned latent space captures a broad distribution of human head shapes.

Practical Implications

Rapid Avatar Pipelines – Studios can generate high‑fidelity 3‑D head assets on‑the‑fly from a few camera shots, eliminating costly manual retopology or per‑subject optimization loops.
Real‑Time Animation – Because the output is a Gaussian cloud that can be rendered at >30 fps on a modern GPU, developers can drive avatars with live facial capture (e.g., blendshape coefficients) for games or virtual meetings.
Scalable Data Collection – The decoupling of Gaussian count from image resolution means existing multi‑camera rigs can be upgraded to higher‑resolution sensors without redesigning the model.
Latent‑Space Editing – The compact latent vectors enable downstream tasks such as identity interpolation, style transfer, or conditional generation (e.g., “create a head with a specific hair style”) using simple MLPs or diffusion models.

Limitations & Future Work

Template Dependency – The UV‑parameterized approach assumes a neutral head template; extreme hairstyles or accessories that deviate far from the template may be under‑represented.
Expression Modeling – While blendshapes can animate the Gaussian cloud, the system does not yet learn a fully disentangled expression latent space, limiting nuanced facial dynamics.
Hardware Footprint – Training on >10 k subjects still requires multi‑GPU clusters; inference is lightweight, but the decoder’s fully‑connected layers can become memory‑intensive for very high‑resolution Gaussian clouds.
Future Directions – The authors suggest extending the framework to full‑body reconstruction, integrating neural texture fields for richer material capture, and exploring self‑supervised scaling to billions of subjects.

Authors

Evangelos Ntavelis
Sean Wu
Mohamad Shahbazi
Fabio Maninchedda
Dmitry Kostiaev
Artem Sevastopolsky
Vittorio Megaro
Trevor Phillips
Alejandro Blumentals
Shridhar Ravikumar
Mehak Gupta
Reinhard Knothe
Jeronimo Bayer
Matthias Vestner
Simon Schaefer
Thomas Etterlin
Christian Zimmermann
Mathias Deschler
Peter Kaufmann
Stefan Brugger
Sebastian Martin
Brian Amberg
Tom Runia

Paper Information

arXiv ID: 2605.04035v1
Categories: cs.CV, cs.LG
Published: May 5, 2026
PDF: Download PDF

[Paper] Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Flow-OPD: On-Policy Distillation for Flow Matching Models

[Paper] SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation