[Paper] From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation
Source: arXiv - 2606.13630v1
Overview
The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.
Key Contributions
This paper presents research in the following areas:
- cs.CL
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.CL.
Authors
- Pedro Correa
- Olivier Perrotin
- Samir Sadok
- Paula Costa
- Thomas Hueber
Paper Information
- arXiv ID: 2606.13630v1
- Categories: cs.CL
- Published: June 11, 2026
- PDF: Download PDF