[Paper] Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization

Published: 1 day ago (March 9, 2026 at 01:24 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.08645v1

Overview

The paper introduces Retrieval‑Augmented Faces (RAF), a training‑time data‑augmentation technique that dramatically improves the ability of template‑free, neural head avatars to reproduce a wide variety of facial expressions. By pulling in “nearest‑neighbor” expressions from a massive, unlabeled expression bank, RAF teaches the model to disentangle identity from expression, making avatars far more robust when driven by unseen or out‑of‑distribution motions.

Key Contributions

Retrieval‑augmented training pipeline that swaps a subset of a subject’s expression features with those of visually similar expressions from a large, unlabeled bank.
Expression‑diversity boost without requiring any extra annotations, cross‑identity paired data, or changes to the underlying avatar architecture.
Quantitative and user‑study validation showing that retrieved neighbors are perceptually closer in pose and expression, and that RAF yields consistent gains on the NeRSemble benchmark (both self‑driving and cross‑driving).
Analysis of identity‑expression decoupling, demonstrating that the augmentation forces the deformation field to generalize beyond the limited expression set seen during standard training.

Methodology

Collect an expression bank – a large repository of facial frames captured from many subjects, but without any labels (e.g., “smile”, “frown”).
Feature extraction – each frame is encoded into a compact expression descriptor using the same encoder that the avatar model already employs.
Nearest‑neighbor retrieval – for each training frame of the target subject, a small random subset of its expression descriptors is replaced by the descriptors of the closest matches from the bank (based on Euclidean distance in feature space).
Reconstruction loss – the model still tries to reconstruct the original target frame, even though part of the input now comes from a different identity. This forces the deformation network to learn a mapping that works for a broader set of expression conditions while preserving the subject’s identity.
Training continues as usual – no architectural changes, no extra supervision, and the retrieval step is lightweight (can be pre‑computed or performed on‑the‑fly with an approximate nearest‑neighbor index).

Results & Findings

Expression fidelity improves by ~10‑15 % on the NeRSemble benchmark when measuring landmark error and perceptual similarity, both for self‑driving (same subject drives) and cross‑driving (different subject drives).
Robustness to distribution shift – avatars trained with RAF maintain visual quality when driven by extreme or rare expressions that were absent from the original subject’s capture set.
User study (N = 30) confirms that participants perceive the retrieved expressions as more similar to the target expression than random baselines, validating the retrieval quality.
Identity preservation remains stable; the model does not “leak” the donor’s facial traits into the target avatar, thanks to the reconstruction loss that anchors the output to the original identity.

Practical Implications

Game & VR developers can generate high‑fidelity, animatable head avatars from a modest capture session and still support a rich repertoire of player‑driven expressions without re‑capturing every nuance.
Live‑streaming & virtual‑influencer pipelines benefit from more reliable facial reenactment when the source performer makes spontaneous, out‑of‑distribution gestures.
AR/VR telepresence systems can maintain expressive fidelity even when network constraints force the use of low‑bitrate or compressed expression descriptors; RAF‑trained models are more tolerant to such noise.
Tooling integration – because RAF is a data‑augmentation layer, it can be dropped into existing avatar training scripts (e.g., PyTorch, TensorFlow) with minimal code changes, accelerating adoption.

Limitations & Future Work

Bank quality dependence – the augmentation’s effectiveness hinges on the diversity and coverage of the expression bank; a poorly populated bank may yield limited gains.
Computational overhead – nearest‑neighbor retrieval adds a modest cost during training (especially for very large banks), though inference remains unchanged.
No explicit pose handling – while expression descriptors capture pose implicitly, extreme head rotations may still challenge the model; future work could incorporate separate pose augmentation.
Cross‑identity generalization – the current setup does not train the model to directly transfer expressions across identities; extending RAF to a fully cross‑identity regime is an open research direction.

Authors

Matan Levy
Gavriel Habib
Issar Tzachor
Dvir Samuel
Rami Ben‑Ari
Nir Darshan
Or Litany
Dani Lischinski

Paper Information

arXiv ID: 2603.08645v1
Categories: cs.CV, cs.GR, cs.LG
Published: March 9, 2026
PDF: Download PDF

[Paper] Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion

[Paper] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

[Paper] No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space