[Paper] From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

Published: (June 11, 2026 at 01:41 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.13630v1

Overview

The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

Key Contributions

This paper presents research in the following areas:

  • cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Pedro Correa
  • Olivier Perrotin
  • Samir Sadok
  • Paula Costa
  • Thomas Hueber

Paper Information

  • arXiv ID: 2606.13630v1
  • Categories: cs.CL
  • Published: June 11, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »