[Paper] Sentiment Analysis of German Sign Language Fairy Tales
Source: arXiv - 2604.16138v1
Overview
The paper introduces the first publicly‑available dataset and a machine‑learning pipeline for sentiment analysis of German Sign Language (DGS) fairy‑tale videos. By combining textual sentiment labels (derived from the original German stories) with pose‑and‑facial motion features extracted from the videos, the authors train an explainable model that can tell whether a signed segment conveys a negative, neutral, or positive emotion. The work bridges a gap between natural‑language sentiment research and the visual‑gesture domain, opening doors for more inclusive language‑technology tools.
Key Contributions
- A new multimodal dataset: 1,200+ aligned pairs of German fairy‑tale text snippets and their DGS video renditions, each annotated with a three‑level sentiment label.
- Robust text‑based sentiment labeling: Leveraging four large language models (LLMs) and majority voting to achieve a high inter‑annotator agreement (Krippendorff’s α = 0.781).
- Feature extraction pipeline: Use of MediaPipe to capture 33 facial landmarks and 33 body‑pose keypoints per frame, turning raw video into structured motion descriptors.
- Explainable classification model: An XGBoost classifier that predicts sentiment from the extracted motion features, achieving a balanced accuracy of 0.631 across the three classes.
- Insightful feature importance analysis: Demonstrates that both facial cues (eyebrow and mouth movement) and body cues (hips, elbows, shoulders) are critical for sentiment discrimination in sign language.
Methodology
-
Textual Sentiment Ground Truth
- The original German fairy‑tale passages were fed to four state‑of‑the‑art LLMs (e.g., GPT‑4, LLaMA‑2).
- Each model produced a sentiment label (negative/neutral/positive).
- A majority‑vote scheme resolved disagreements, yielding a high‑quality label set.
-
Video Feature Extraction
- Each DGS video segment was processed frame‑by‑frame with MediaPipe.
- The pipeline outputs 33 facial landmarks (e.g., eyebrow height, mouth opening) and 33 body‑pose landmarks (e.g., shoulder rotation, hip displacement).
- Temporal statistics (mean, variance, velocity) were computed over the segment to form a fixed‑length feature vector.
-
Model Training & Explainability
- The feature vectors and their corresponding sentiment labels were fed to an XGBoost gradient‑boosted tree classifier.
- Hyperparameters were tuned via cross‑validation.
- SHAP (SHapley Additive exPlanations) values were used to rank feature importance and provide human‑readable explanations.
-
Evaluation
- Balanced accuracy (average of per‑class recall) was the primary metric, mitigating class‑imbalance effects.
- A 5‑fold cross‑validation scheme ensured robust performance estimates.
Results & Findings
| Metric | Value |
|---|---|
| Balanced Accuracy (overall) | 0.631 |
| Per‑class recall (avg.) | 0.62 (neg), 0.64 (neu), 0.63 (pos) |
| Krippendorff’s α (text labels) | 0.781 |
- Feature importance: The top‑10 contributors include eyebrow raise amplitude, mouth width, hip lateral movement, elbow flexion speed, and shoulder rotation.
- Face vs. Body: Contrary to the common assumption that facial expression dominates sentiment in sign language, body motion accounts for roughly 45 % of the predictive power.
- Error patterns: Misclassifications often occur on subtle neutral passages where both facial and body cues are minimal, suggesting a need for richer contextual modeling (e.g., handshape semantics).
Practical Implications
- Inclusive sentiment‑aware applications: Chatbots, virtual assistants, or content‑moderation tools can now interpret emotional tone directly from sign‑language video streams, making them accessible to Deaf users.
- Automatic captioning & summarization: Sentiment tags can enrich sign‑language video transcripts, enabling emotion‑aware search and recommendation engines for educational or entertainment content.
- Human‑computer interaction (HCI): Developers of AR/VR avatars that communicate via sign language can embed the model to adjust avatar expressiveness in real time, improving user experience.
- Cross‑modal research: The dataset and pipeline provide a benchmark for multimodal sentiment analysis, encouraging further work on other sign languages or gesture‑rich domains (e.g., dance, sports).
Limitations & Future Work
- Dataset scope: The current collection is limited to fairy‑tale narratives in German; broader domains (news, casual conversation) and other sign languages remain unexplored.
- Temporal modeling: The XGBoost approach treats each segment as a static feature vector; incorporating sequence models (e.g., Transformers or LSTMs) could capture longer‑range dynamics.
- Label granularity: A three‑level valence scheme may be too coarse for nuanced emotions (e.g., surprise, disgust). Future work could adopt a richer affective taxonomy or continuous valence‑arousal scales.
- Real‑time feasibility: While MediaPipe runs efficiently, the full pipeline (feature extraction + XGBoost inference) still needs profiling for low‑latency deployment on edge devices.
Bottom line: By showing that both facial and bodily movements are essential for sentiment detection in sign language, this research paves the way for more emotionally intelligent, Deaf‑friendly AI systems. Developers interested in building inclusive media platforms or multimodal AI can start experimenting with the released dataset and codebase right away.
Authors
- Fabrizio Nunnari
- Siddhant Jain
- Patrick Gebhard
Paper Information
- arXiv ID: 2604.16138v1
- Categories: cs.CL, cs.LG
- Published: April 17, 2026
- PDF: Download PDF