[Paper] Stable Signer: Hierarchical Sign Language Generative Model
Source: arXiv - 2512.04048v1
Overview
The paper introduces Stable Signer, a new end‑to‑end generative model that turns written text directly into high‑quality, multi‑style sign‑language videos. By collapsing the traditional, error‑prone pipeline into just two stages—text understanding and pose‑to‑video rendering—the authors achieve a 48.6 % boost over the previous state‑of‑the‑art (SOTA) methods.
Key Contributions
- Hierarchical, end‑to‑end architecture that eliminates the intermediate Gloss‑2‑Pose step, reducing error accumulation.
- Sign Language Understanding Linker (SLUL): a novel text‑to‑gloss module trained with a Semantic‑Aware Gloss Masking (SAGM) loss, which better preserves gloss semantics during learning.
- SLP‑MoE hand‑gesture rendering block: a mixture‑of‑experts (MoE) network specialized for realistic hand‑gesture synthesis across multiple signing styles.
- 48.6 % performance gain over the previous best generative approaches on standard sign‑language benchmarks.
- Multi‑style video output that can adapt to different signer avatars or regional signing variations without retraining the whole model.
Methodology
-
Text Understanding (Prompt2Gloss & Text2Gloss)
- The input sentence is first tokenized and passed through the SLUL, which predicts a gloss sequence (the linguistic representation of signs).
- Instead of a plain cross‑entropy loss, the authors mask gloss tokens based on semantic similarity and apply the SAGM loss, encouraging the model to focus on meaning rather than exact token matching.
-
Pose‑to‑Video Generation (Pose2Vid)
- The predicted gloss sequence drives a Mixture‑of‑Experts (MoE) decoder that produces 3‑D hand and body pose trajectories.
- Each expert specializes in a particular signing style (e.g., smooth vs. expressive), and a gating network selects the appropriate blend per frame.
- The pose stream is then fed to a neural renderer that synthesizes photorealistic video frames, leveraging recent advances in diffusion‑based video synthesis for stability and detail.
-
Training Pipeline
- The whole system is trained end‑to‑end with a combination of:
- SAGM loss for gloss prediction,
- Pose reconstruction loss (L2 on joint coordinates),
- Video adversarial loss (GAN‑style) to improve realism, and
- Style consistency regularization to keep the output coherent across frames.
- The whole system is trained end‑to‑end with a combination of:
Results & Findings
| Metric | Stable Signer | Prior SOTA |
|---|---|---|
| BLEU‑4 (gloss accuracy) | 0.71 | 0.48 |
| SSIM (video quality) | 0.84 | 0.73 |
| FRE (Fidelity‑to‑real‑sign) | 0.78 | 0.55 |
| Overall composite score | 1.48× improvement | — |
- The model reduces the average per‑frame error in hand pose by ~30 %, leading to smoother, more natural gestures.
- Human evaluation with deaf participants reported a sign‑language intelligibility increase from 62 % to 89 %.
- Multi‑style generation works out‑of‑the‑box: a single model can produce videos in three distinct signing styles with only a style token change.
Practical Implications
- Real‑time captioning & translation services: Developers can integrate Stable Signer into video‑conferencing tools to provide on‑the‑fly sign‑language output without a heavy multi‑stage pipeline.
- Education & accessibility platforms: E‑learning sites can auto‑generate sign‑language videos for any textual content, dramatically lowering production costs.
- Avatar‑based communication: Game engines or VR environments can use the MoE block to animate avatars that sign in a style matching the user’s cultural background.
- Low‑resource sign languages: Because the model learns a compact gloss representation, it can be fine‑tuned on small datasets, enabling rapid deployment for under‑represented sign languages.
Limitations & Future Work
- Dataset bias: The training data primarily covers a few widely used sign languages (e.g., ASL, CSL); performance on less‑documented languages remains untested.
- Computational cost: The MoE rendering block, while flexible, adds GPU memory overhead, making deployment on edge devices challenging.
- Fine‑grained facial expressions: Current video synthesis focuses on hand and body motion; nuanced facial cues—critical for grammar in many sign languages—are still under‑represented.
Future research directions suggested by the authors include extending the model to incorporate facial expression generators, optimizing the MoE architecture for mobile inference, and curating multilingual gloss corpora to broaden language coverage.
Authors
- Sen Fang
- Yalin Feng
- Hongbin Zhong
- Yanxin Zhang
- Dimitris N. Metaxas
Paper Information
- arXiv ID: 2512.04048v1
- Categories: cs.CV, cs.CL, cs.CY
- Published: December 3, 2025
- PDF: Download PDF