[Paper] X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

Published: (January 6, 2026 at 12:16 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03194v1

Overview

The paper introduces X‑MuTeST, a new multilingual benchmark and training framework that tackles two persistent problems in hate‑speech detection:

  1. Achieving high accuracy on low‑resource Indic languages (Hindi and Telugu)
  2. Providing human‑readable explanations for each prediction

By marrying large‑language‑model (LLM) reasoning with classic attention‑boosting tricks, the authors show that models can become both more accurate and more transparent.

Key Contributions

  • Multilingual rationale dataset – token‑level human‑annotated rationales for 6,004 Hindi, 4,492 Telugu, and 6,334 English posts, the first such resource for Indic hate‑speech detection.
  • X‑MuTeST explainability framework – computes the impact of unigrams, bigrams, and trigrams on model confidence and merges these “perturbation‑based” explanations with LLM‑generated rationales.
  • Explainability‑guided training – incorporates human rationales directly into the loss function, nudging the model’s attention toward the words humans deem important.
  • Comprehensive evaluation – reports both plausibility (Token‑F1, IOU‑F1) and faithfulness (Comprehensiveness, Sufficiency) metrics, demonstrating gains over baseline classifiers.
  • Open‑source release – dataset, code, and trained checkpoints are publicly available, encouraging reproducibility and downstream research.

Methodology

  1. Data collection & annotation – Social‑media posts in English, Hindi, and Telugu were labeled for hate speech. Annotators also highlighted the exact tokens that justified each label, creating a token‑level rationale set.
  2. Baseline classifier – A standard transformer (e.g., BERT‑base) fine‑tuned on the three languages serves as the starting point.
  3. Perturbation‑based X‑MuTeST explanations – For every input, the model’s prediction probability is recomputed after masking each unigram, bigram, and trigram. The drop in confidence indicates how “important” that n‑gram is.
  4. LLM‑consulted rationales – An external LLM (e.g., GPT‑4) is prompted to generate a textual justification for the prediction. The LLM’s highlighted tokens are extracted.
  5. Union of explanations – The final explanation set is the union of the perturbation‑based tokens and the LLM‑derived tokens.
  6. Explainability‑guided training – A secondary loss term penalizes divergence between the model’s attention distribution and the union explanation, effectively teaching the model to “look” at the right words.
  7. Evaluation – Plausibility metrics compare model explanations against human rationales; faithfulness metrics assess whether removing highlighted tokens truly changes the prediction.

Results & Findings

LanguageBaseline F1X‑MuTeST‑enhanced F1Token‑F1 (plausibility)Comprehensiveness (faithfulness)
English84.2%87.6%68.4% → 74.9%0.42 → 0.31 (lower = better)
Hindi78.9%82.3%61.1% → 68.2%0.48 → 0.35
Telugu76.5%80.1%59.3% → 66.7%0.51 → 0.36
  • Accuracy boost: Adding human rationales and the X‑MuTeST explanation loss consistently improves macro‑F1 across all three languages (≈3–4 points).
  • Better explanations: Token‑F1 and IOU‑F1 rise by 5–7 points, indicating that the model’s highlighted words align more closely with human judgment.
  • Higher faithfulness: Lower Comprehensiveness and Sufficiency scores show that the explanations are not just plausible but actually drive the model’s decisions.

Practical Implications

  • Content‑moderation pipelines can adopt X‑MuTeST‑trained models to flag hate speech and surface the exact words responsible, giving moderators a quick sanity check and reducing false positives.
  • Regulatory compliance (e.g., GDPR “right to explanation”) becomes easier when the system can point to token‑level rationales that are both human‑validated and LLM‑backed.
  • Cross‑lingual deployment: Since the framework works out‑of‑the‑box for Hindi and Telugu, platforms targeting emerging markets can roll out more reliable moderation without building language‑specific models from scratch.
  • Developer tooling: The open‑source code includes utilities to generate explanations on‑the‑fly, enabling integration into IDE plugins, chatbot safety layers, or real‑time comment filters.
  • Transfer learning: The rationale‑aware loss can be grafted onto other text‑classification tasks (e.g., toxic comment detection, misinformation labeling) to improve interpretability without sacrificing performance.

Limitations & Future Work

  • Rationale quality variance: Human annotators sometimes disagreed on which tokens were “responsible,” leading to noisy supervision; the paper reports an inter‑annotator agreement of ~0.71 (Cohen’s κ).
  • Scalability of perturbations: Computing confidence drops for every n‑gram is O(N²) in sequence length, which can be costly for long posts; approximate sampling strategies are suggested but not fully explored.
  • LLM dependency: The quality of LLM‑generated rationales hinges on prompt design and model size; cheaper LLMs may produce weaker explanations.
  • Domain shift: The dataset focuses on social‑media comments; performance on news articles, forums, or code‑review comments remains untested.

Future directions include:

  1. Leveraging lightweight attribution methods (e.g., Integrated Gradients) to replace exhaustive n‑gram masking.
  2. Expanding the benchmark to more low‑resource languages.
  3. Investigating active‑learning loops where model explanations solicit further human feedback.

Authors

  • Mohammad Zia Ur Rehman
  • Sai Kartheek Reddy Kasu
  • Shashivardhan Reddy Koppula
  • Sai Rithwik Reddy Chirra
  • Shwetank Shekhar Singh
  • Nagendra Kumar

Paper Information

  • arXiv ID: 2601.03194v1
  • Categories: cs.CL
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »