[Paper] Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
Source: arXiv - 2605.04035v1
Overview
HeadsUp introduces a feed‑forward pipeline that can turn thousands of high‑resolution multi‑camera images into a detailed 3‑D head model represented by Gaussian splats. By learning a compact latent code for each subject, the system can reconstruct a new head in a single forward pass—no per‑subject optimization required—making it practical for large‑scale production pipelines such as avatar creation, virtual production, or AR/VR experiences.
Key Contributions
- Scalable encoder‑decoder architecture that compresses arbitrary numbers of input views into a fixed‑size latent vector.
- UV‑parameterized 3‑D Gaussian representation anchored to a neutral head template, decoupling the number of Gaussians from image resolution or view count.
- Training on an unprecedented dataset of >10 k subjects (≈10× larger than prior multi‑view head corpora), demonstrating strong generalization to unseen identities.
- State‑of‑the‑art reconstruction quality without any test‑time optimization, outperforming existing neural rendering and mesh‑based methods.
- Demonstrated downstream utilities: (1) latent‑space interpolation for generating novel 3‑D identities, and (2) driving the reconstructed heads with expression blendshapes for real‑time animation.
Methodology
- Data Ingestion – Multi‑camera rigs capture dozens of high‑resolution RGB images of a subject’s head from many angles.
- Encoder – A lightweight CNN processes each view independently, extracting per‑view features. These are pooled (e.g., max‑pool or attention) into a single latent vector that summarizes the subject’s geometry and appearance.
- Decoder – The latent vector is fed into a fully‑connected decoder that predicts parameters for a dense set of 3‑D Gaussians placed on a UV‑mapped neutral head template. Each Gaussian stores position, covariance (shape), color, and opacity.
- Rendering – At inference time, the Gaussian cloud is rasterized using splatting (similar to the popular “3‑D Gaussian Splatting” technique), producing photorealistic novel‑view images.
- Training Objective – A combination of multi‑view photometric loss, perceptual loss, and regularizers on Gaussian size/overlap ensures both fidelity and stability.
Because the UV layout ties every Gaussian to a fixed location on the template, the number of Gaussians stays constant regardless of how many input images are used, enabling the model to ingest very high‑resolution data without blowing up memory.
Results & Findings
- Quantitative: HeadsUp reduces LPIPS (Learned Perceptual Image Patch Similarity) by ~15 % and improves PSNR by ~2 dB compared to the best prior multi‑view head reconstruction baseline.
- Qualitative: Reconstructed heads preserve fine details such as hair strands, subtle skin texture, and accurate ear geometry, even when only 8–12 views are supplied at test time.
- Scalability: Experiments varying the number of training subjects, input views, and decoder capacity show a predictable trade‑off: doubling the latent dimension yields ~0.5 dB PSNR gain, while adding more views beyond 20 yields diminishing returns.
- Generalization: On a held‑out set of 1 k identities, the model achieves comparable quality to per‑subject optimized methods, confirming that the learned latent space captures a broad distribution of human head shapes.
Practical Implications
- Rapid Avatar Pipelines – Studios can generate high‑fidelity 3‑D head assets on‑the‑fly from a few camera shots, eliminating costly manual retopology or per‑subject optimization loops.
- Real‑Time Animation – Because the output is a Gaussian cloud that can be rendered at >30 fps on a modern GPU, developers can drive avatars with live facial capture (e.g., blendshape coefficients) for games or virtual meetings.
- Scalable Data Collection – The decoupling of Gaussian count from image resolution means existing multi‑camera rigs can be upgraded to higher‑resolution sensors without redesigning the model.
- Latent‑Space Editing – The compact latent vectors enable downstream tasks such as identity interpolation, style transfer, or conditional generation (e.g., “create a head with a specific hair style”) using simple MLPs or diffusion models.
Limitations & Future Work
- Template Dependency – The UV‑parameterized approach assumes a neutral head template; extreme hairstyles or accessories that deviate far from the template may be under‑represented.
- Expression Modeling – While blendshapes can animate the Gaussian cloud, the system does not yet learn a fully disentangled expression latent space, limiting nuanced facial dynamics.
- Hardware Footprint – Training on >10 k subjects still requires multi‑GPU clusters; inference is lightweight, but the decoder’s fully‑connected layers can become memory‑intensive for very high‑resolution Gaussian clouds.
- Future Directions – The authors suggest extending the framework to full‑body reconstruction, integrating neural texture fields for richer material capture, and exploring self‑supervised scaling to billions of subjects.
Authors
- Evangelos Ntavelis
- Sean Wu
- Mohamad Shahbazi
- Fabio Maninchedda
- Dmitry Kostiaev
- Artem Sevastopolsky
- Vittorio Megaro
- Trevor Phillips
- Alejandro Blumentals
- Shridhar Ravikumar
- Mehak Gupta
- Reinhard Knothe
- Jeronimo Bayer
- Matthias Vestner
- Simon Schaefer
- Thomas Etterlin
- Christian Zimmermann
- Mathias Deschler
- Peter Kaufmann
- Stefan Brugger
- Sebastian Martin
- Brian Amberg
- Tom Runia
Paper Information
- arXiv ID: 2605.04035v1
- Categories: cs.CV, cs.LG
- Published: May 5, 2026
- PDF: Download PDF