[Paper] Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization

Published: (March 9, 2026 at 01:24 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08645v1

Overview

The paper introduces Retrieval‑Augmented Faces (RAF), a training‑time data‑augmentation technique that dramatically improves the ability of template‑free, neural head avatars to reproduce a wide variety of facial expressions. By pulling in “nearest‑neighbor” expressions from a massive, unlabeled expression bank, RAF teaches the model to disentangle identity from expression, making avatars far more robust when driven by unseen or out‑of‑distribution motions.

Key Contributions

  • Retrieval‑augmented training pipeline that swaps a subset of a subject’s expression features with those of visually similar expressions from a large, unlabeled bank.
  • Expression‑diversity boost without requiring any extra annotations, cross‑identity paired data, or changes to the underlying avatar architecture.
  • Quantitative and user‑study validation showing that retrieved neighbors are perceptually closer in pose and expression, and that RAF yields consistent gains on the NeRSemble benchmark (both self‑driving and cross‑driving).
  • Analysis of identity‑expression decoupling, demonstrating that the augmentation forces the deformation field to generalize beyond the limited expression set seen during standard training.

Methodology

  1. Collect an expression bank – a large repository of facial frames captured from many subjects, but without any labels (e.g., “smile”, “frown”).
  2. Feature extraction – each frame is encoded into a compact expression descriptor using the same encoder that the avatar model already employs.
  3. Nearest‑neighbor retrieval – for each training frame of the target subject, a small random subset of its expression descriptors is replaced by the descriptors of the closest matches from the bank (based on Euclidean distance in feature space).
  4. Reconstruction loss – the model still tries to reconstruct the original target frame, even though part of the input now comes from a different identity. This forces the deformation network to learn a mapping that works for a broader set of expression conditions while preserving the subject’s identity.
  5. Training continues as usual – no architectural changes, no extra supervision, and the retrieval step is lightweight (can be pre‑computed or performed on‑the‑fly with an approximate nearest‑neighbor index).

Results & Findings

  • Expression fidelity improves by ~10‑15 % on the NeRSemble benchmark when measuring landmark error and perceptual similarity, both for self‑driving (same subject drives) and cross‑driving (different subject drives).
  • Robustness to distribution shift – avatars trained with RAF maintain visual quality when driven by extreme or rare expressions that were absent from the original subject’s capture set.
  • User study (N = 30) confirms that participants perceive the retrieved expressions as more similar to the target expression than random baselines, validating the retrieval quality.
  • Identity preservation remains stable; the model does not “leak” the donor’s facial traits into the target avatar, thanks to the reconstruction loss that anchors the output to the original identity.

Practical Implications

  • Game & VR developers can generate high‑fidelity, animatable head avatars from a modest capture session and still support a rich repertoire of player‑driven expressions without re‑capturing every nuance.
  • Live‑streaming & virtual‑influencer pipelines benefit from more reliable facial reenactment when the source performer makes spontaneous, out‑of‑distribution gestures.
  • AR/VR telepresence systems can maintain expressive fidelity even when network constraints force the use of low‑bitrate or compressed expression descriptors; RAF‑trained models are more tolerant to such noise.
  • Tooling integration – because RAF is a data‑augmentation layer, it can be dropped into existing avatar training scripts (e.g., PyTorch, TensorFlow) with minimal code changes, accelerating adoption.

Limitations & Future Work

  • Bank quality dependence – the augmentation’s effectiveness hinges on the diversity and coverage of the expression bank; a poorly populated bank may yield limited gains.
  • Computational overhead – nearest‑neighbor retrieval adds a modest cost during training (especially for very large banks), though inference remains unchanged.
  • No explicit pose handling – while expression descriptors capture pose implicitly, extreme head rotations may still challenge the model; future work could incorporate separate pose augmentation.
  • Cross‑identity generalization – the current setup does not train the model to directly transfer expressions across identities; extending RAF to a fully cross‑identity regime is an open research direction.

Authors

  • Matan Levy
  • Gavriel Habib
  • Issar Tzachor
  • Dvir Samuel
  • Rami Ben‑Ari
  • Nir Darshan
  • Or Litany
  • Dani Lischinski

Paper Information

  • arXiv ID: 2603.08645v1
  • Categories: cs.CV, cs.GR, cs.LG
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »