[Paper] From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

Published: 3 days ago (June 11, 2026 at 01:41 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.13630v1

Overview

The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Pedro Correa
Olivier Perrotin
Samir Sadok
Paula Costa
Thomas Hueber

Paper Information

arXiv ID: 2606.13630v1
Categories: cs.CL
Published: June 11, 2026
PDF: Download PDF

[Paper] From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

[Paper] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents