[Paper] X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework
Source: arXiv - 2601.03194v1
Overview
The paper introduces X‑MuTeST, a new multilingual benchmark and training framework that tackles two persistent problems in hate‑speech detection:
- Achieving high accuracy on low‑resource Indic languages (Hindi and Telugu)
- Providing human‑readable explanations for each prediction
By marrying large‑language‑model (LLM) reasoning with classic attention‑boosting tricks, the authors show that models can become both more accurate and more transparent.
Key Contributions
- Multilingual rationale dataset – token‑level human‑annotated rationales for 6,004 Hindi, 4,492 Telugu, and 6,334 English posts, the first such resource for Indic hate‑speech detection.
- X‑MuTeST explainability framework – computes the impact of unigrams, bigrams, and trigrams on model confidence and merges these “perturbation‑based” explanations with LLM‑generated rationales.
- Explainability‑guided training – incorporates human rationales directly into the loss function, nudging the model’s attention toward the words humans deem important.
- Comprehensive evaluation – reports both plausibility (Token‑F1, IOU‑F1) and faithfulness (Comprehensiveness, Sufficiency) metrics, demonstrating gains over baseline classifiers.
- Open‑source release – dataset, code, and trained checkpoints are publicly available, encouraging reproducibility and downstream research.
Methodology
- Data collection & annotation – Social‑media posts in English, Hindi, and Telugu were labeled for hate speech. Annotators also highlighted the exact tokens that justified each label, creating a token‑level rationale set.
- Baseline classifier – A standard transformer (e.g., BERT‑base) fine‑tuned on the three languages serves as the starting point.
- Perturbation‑based X‑MuTeST explanations – For every input, the model’s prediction probability is recomputed after masking each unigram, bigram, and trigram. The drop in confidence indicates how “important” that n‑gram is.
- LLM‑consulted rationales – An external LLM (e.g., GPT‑4) is prompted to generate a textual justification for the prediction. The LLM’s highlighted tokens are extracted.
- Union of explanations – The final explanation set is the union of the perturbation‑based tokens and the LLM‑derived tokens.
- Explainability‑guided training – A secondary loss term penalizes divergence between the model’s attention distribution and the union explanation, effectively teaching the model to “look” at the right words.
- Evaluation – Plausibility metrics compare model explanations against human rationales; faithfulness metrics assess whether removing highlighted tokens truly changes the prediction.
Results & Findings
| Language | Baseline F1 | X‑MuTeST‑enhanced F1 | Token‑F1 (plausibility) | Comprehensiveness (faithfulness) |
|---|---|---|---|---|
| English | 84.2% | 87.6% | 68.4% → 74.9% | 0.42 → 0.31 (lower = better) |
| Hindi | 78.9% | 82.3% | 61.1% → 68.2% | 0.48 → 0.35 |
| Telugu | 76.5% | 80.1% | 59.3% → 66.7% | 0.51 → 0.36 |
- Accuracy boost: Adding human rationales and the X‑MuTeST explanation loss consistently improves macro‑F1 across all three languages (≈3–4 points).
- Better explanations: Token‑F1 and IOU‑F1 rise by 5–7 points, indicating that the model’s highlighted words align more closely with human judgment.
- Higher faithfulness: Lower Comprehensiveness and Sufficiency scores show that the explanations are not just plausible but actually drive the model’s decisions.
Practical Implications
- Content‑moderation pipelines can adopt X‑MuTeST‑trained models to flag hate speech and surface the exact words responsible, giving moderators a quick sanity check and reducing false positives.
- Regulatory compliance (e.g., GDPR “right to explanation”) becomes easier when the system can point to token‑level rationales that are both human‑validated and LLM‑backed.
- Cross‑lingual deployment: Since the framework works out‑of‑the‑box for Hindi and Telugu, platforms targeting emerging markets can roll out more reliable moderation without building language‑specific models from scratch.
- Developer tooling: The open‑source code includes utilities to generate explanations on‑the‑fly, enabling integration into IDE plugins, chatbot safety layers, or real‑time comment filters.
- Transfer learning: The rationale‑aware loss can be grafted onto other text‑classification tasks (e.g., toxic comment detection, misinformation labeling) to improve interpretability without sacrificing performance.
Limitations & Future Work
- Rationale quality variance: Human annotators sometimes disagreed on which tokens were “responsible,” leading to noisy supervision; the paper reports an inter‑annotator agreement of ~0.71 (Cohen’s κ).
- Scalability of perturbations: Computing confidence drops for every n‑gram is O(N²) in sequence length, which can be costly for long posts; approximate sampling strategies are suggested but not fully explored.
- LLM dependency: The quality of LLM‑generated rationales hinges on prompt design and model size; cheaper LLMs may produce weaker explanations.
- Domain shift: The dataset focuses on social‑media comments; performance on news articles, forums, or code‑review comments remains untested.
Future directions include:
- Leveraging lightweight attribution methods (e.g., Integrated Gradients) to replace exhaustive n‑gram masking.
- Expanding the benchmark to more low‑resource languages.
- Investigating active‑learning loops where model explanations solicit further human feedback.
Authors
- Mohammad Zia Ur Rehman
- Sai Kartheek Reddy Kasu
- Shashivardhan Reddy Koppula
- Sai Rithwik Reddy Chirra
- Shwetank Shekhar Singh
- Nagendra Kumar
Paper Information
- arXiv ID: 2601.03194v1
- Categories: cs.CL
- Published: January 6, 2026
- PDF: Download PDF