[Paper] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
Source: arXiv - 2601.05911v1
Overview
The Pantagruel project introduces a new family of self‑supervised encoder models that work equally well on French text and French speech. By training the encoders to predict feature‑space representations instead of language‑specific tokens, the authors achieve a unified architecture that captures linguistic patterns and acoustic cues more efficiently than traditional modality‑specific models.
Key Contributions
- Unified encoder design – a single architecture that can ingest either raw audio waveforms or tokenized text without any structural changes.
- Feature‑space self‑supervision – predicts continuous target embeddings rather than discrete tokens, enabling richer cross‑modal learning.
- Large‑scale French pre‑training corpora:
- Text: French Wikipedia, OSCAR, CroissantLLM (hundreds of millions of sentences).
- Speech: Multilingual LibriSpeech, LeBenchmark, and the newly released INA‑100k (100 k‑hour French broadcast audio).
- Strong empirical results on a wide spectrum of French NLP and speech tasks (FLUE, LeBenchmark, etc.), often surpassing state‑of‑the‑art French models such as CamemBERT, FlauBERT, and LeBenchmark 2.0.
- Open‑source release of the pretrained models and the INA‑100k dataset, lowering the barrier for French multimodal research and product development.
Methodology
- Separate modality encoders – a text encoder (based on a Transformer language model) and a speech encoder (based on a convolution‑augmented Transformer). Both share the same high‑level architecture and output dimensionality.
- Self‑supervised objective – instead of classic masked‑language‑modeling (MLM) or contrastive audio‑text alignment, Pantagruel masks portions of the input and asks the encoder to reconstruct continuous target vectors that have been pre‑computed by a teacher network. This “feature‑space prediction” encourages the model to learn contextualized embeddings that are directly comparable across modalities.
- Large‑scale pre‑training – each encoder is trained on its respective corpus for several weeks on multi‑GPU clusters, using mixed‑precision training and gradient accumulation to handle the massive data volume.
- Fine‑tuning – downstream tasks receive a lightweight classification or regression head on top of the frozen encoder, following the standard “pre‑train → fine‑tune” paradigm.
Results & Findings
| Task (modality) | Baseline(s) | Pantagruel Score | Relative Gain |
|---|---|---|---|
| French GLUE (FLUE) – sentiment | CamemBERT | 92.1% | +1.8 pts |
| Speech intent classification (LeBenchmark) | LeBenchmark 2.0 | 94.5% | +2.3 pts |
| Named‑entity recognition (text) | FlauBERT | 96.7% | +0.9 pts |
| Speech‑to‑text keyword spotting | Multilingual LibriSpeech model | 89.4% | +3.1 pts |
- Across all evaluated tasks, Pantagruel matches or exceeds the best French‑only baselines while using a single shared architecture.
- The feature‑space objective yields smoother convergence and better generalisation, especially on low‑resource speech domains (e.g., regional accents in INA‑100k).
- Ablation studies show that removing the continuous target prediction drops performance by 2–4 percentage points, confirming its central role.
Practical Implications
- Rapid prototyping of multimodal French AI – developers can plug the same encoder into a chatbot, a voice assistant, or a transcription pipeline without swapping models.
- Cost‑effective deployment – a unified model reduces memory footprint and simplifies serving infrastructure (one Docker image, one set of inference APIs).
- Better handling of noisy broadcast audio – thanks to the diverse INA‑100k pre‑training data, the speech encoder is robust to background music, overlapping speakers, and varied recording conditions typical of radio/TV archives.
- Transfer learning for niche domains – fine‑tuning on a small labeled set (e.g., legal transcripts or medical dictations) is expected to be more data‑efficient because the encoder already captures cross‑modal linguistic regularities.
- Open‑source ecosystem – the released checkpoints and dataset enable the community to build French‑centric multimodal products faster, from automated subtitling tools to multimodal sentiment analytics.
Limitations & Future Work
- Language scope – Pantagruel is currently French‑only; extending the approach to truly multilingual settings would require additional cross‑lingual alignment work.
- Compute requirements – pre‑training on 100 k‑hour audio still demands substantial GPU resources, which may be prohibitive for smaller labs.
- Downstream adaptation – while the encoder is universal, task‑specific heads still need careful design for complex generation tasks (e.g., end‑to‑end speech‑to‑text).
- Future directions suggested by the authors include:
- Integrating a joint text‑speech encoder that can process mixed inputs (e.g., audio with embedded subtitles).
- Exploring cross‑modal contrastive losses to further tighten the alignment between modalities.
- Scaling the approach to other high‑resource languages to validate its generality.
Authors
- Phuong-Hang Le
- Valentin Pelloin
- Arnault Chatelain
- Maryem Bouziane
- Mohammed Ghennai
- Qianwen Guan
- Kirill Milintsevich
- Salima Mdhaffar
- Aidan Mannion
- Nils Defauw
- Shuyue Gu
- Alexandre Audibert
- Marco Dinarelli
- Yannick Estève
- Lorraine Goeuriot
- Steffen Lalande
- Nicolas Hervé
- Maximin Coavoux
- François Portet
- Étienne Ollion
- Marie Candito
- Maxime Peyrard
- Solange Rossato
- Benjamin Lecouteux
- Aurélie Nardy
- Gilles Sérasset
- Vincent Segonne
- Solène Evain
- Diandra Fabre
- Didier Schwab
Paper Information
- arXiv ID: 2601.05911v1
- Categories: cs.CL
- Published: January 9, 2026
- PDF: Download PDF