[Paper] Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization
Source: arXiv - 2603.08645v1
Overview
The paper introduces Retrieval‑Augmented Faces (RAF), a training‑time data‑augmentation technique that dramatically improves the ability of template‑free, neural head avatars to reproduce a wide variety of facial expressions. By pulling in “nearest‑neighbor” expressions from a massive, unlabeled expression bank, RAF teaches the model to disentangle identity from expression, making avatars far more robust when driven by unseen or out‑of‑distribution motions.
Key Contributions
- Retrieval‑augmented training pipeline that swaps a subset of a subject’s expression features with those of visually similar expressions from a large, unlabeled bank.
- Expression‑diversity boost without requiring any extra annotations, cross‑identity paired data, or changes to the underlying avatar architecture.
- Quantitative and user‑study validation showing that retrieved neighbors are perceptually closer in pose and expression, and that RAF yields consistent gains on the NeRSemble benchmark (both self‑driving and cross‑driving).
- Analysis of identity‑expression decoupling, demonstrating that the augmentation forces the deformation field to generalize beyond the limited expression set seen during standard training.
Methodology
- Collect an expression bank – a large repository of facial frames captured from many subjects, but without any labels (e.g., “smile”, “frown”).
- Feature extraction – each frame is encoded into a compact expression descriptor using the same encoder that the avatar model already employs.
- Nearest‑neighbor retrieval – for each training frame of the target subject, a small random subset of its expression descriptors is replaced by the descriptors of the closest matches from the bank (based on Euclidean distance in feature space).
- Reconstruction loss – the model still tries to reconstruct the original target frame, even though part of the input now comes from a different identity. This forces the deformation network to learn a mapping that works for a broader set of expression conditions while preserving the subject’s identity.
- Training continues as usual – no architectural changes, no extra supervision, and the retrieval step is lightweight (can be pre‑computed or performed on‑the‑fly with an approximate nearest‑neighbor index).
Results & Findings
- Expression fidelity improves by ~10‑15 % on the NeRSemble benchmark when measuring landmark error and perceptual similarity, both for self‑driving (same subject drives) and cross‑driving (different subject drives).
- Robustness to distribution shift – avatars trained with RAF maintain visual quality when driven by extreme or rare expressions that were absent from the original subject’s capture set.
- User study (N = 30) confirms that participants perceive the retrieved expressions as more similar to the target expression than random baselines, validating the retrieval quality.
- Identity preservation remains stable; the model does not “leak” the donor’s facial traits into the target avatar, thanks to the reconstruction loss that anchors the output to the original identity.
Practical Implications
- Game & VR developers can generate high‑fidelity, animatable head avatars from a modest capture session and still support a rich repertoire of player‑driven expressions without re‑capturing every nuance.
- Live‑streaming & virtual‑influencer pipelines benefit from more reliable facial reenactment when the source performer makes spontaneous, out‑of‑distribution gestures.
- AR/VR telepresence systems can maintain expressive fidelity even when network constraints force the use of low‑bitrate or compressed expression descriptors; RAF‑trained models are more tolerant to such noise.
- Tooling integration – because RAF is a data‑augmentation layer, it can be dropped into existing avatar training scripts (e.g., PyTorch, TensorFlow) with minimal code changes, accelerating adoption.
Limitations & Future Work
- Bank quality dependence – the augmentation’s effectiveness hinges on the diversity and coverage of the expression bank; a poorly populated bank may yield limited gains.
- Computational overhead – nearest‑neighbor retrieval adds a modest cost during training (especially for very large banks), though inference remains unchanged.
- No explicit pose handling – while expression descriptors capture pose implicitly, extreme head rotations may still challenge the model; future work could incorporate separate pose augmentation.
- Cross‑identity generalization – the current setup does not train the model to directly transfer expressions across identities; extending RAF to a fully cross‑identity regime is an open research direction.
Authors
- Matan Levy
- Gavriel Habib
- Issar Tzachor
- Dvir Samuel
- Rami Ben‑Ari
- Nir Darshan
- Or Litany
- Dani Lischinski
Paper Information
- arXiv ID: 2603.08645v1
- Categories: cs.CV, cs.GR, cs.LG
- Published: March 9, 2026
- PDF: Download PDF