[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection
Source: arXiv - 2605.31563v1
Overview
The paper Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection examines a hidden source of noise in NLP datasets—disagreement not only in labels but also in the token‑level explanations (rationales) that humans provide. By systematically re‑implementing a suite of classification and explanation models, the authors show that traditional evaluation pipelines (majority‑vote labels + hard rationales) miss a lot of the nuance that arises in subjective tasks like hate‑speech detection. Their unified framework reveals that softer, probabilistic representations of both labels and rationales lead to more reliable performance and richer insights.
Key Contributions
- Unified evaluation protocol that brings together diverse model architectures, loss functions, and existing metrics under a single, reproducible pipeline.
- Three‑dimensional explainability taxonomy (plausibility, faithfulness, complexity) applied consistently across models.
- Systematic comparison of label and rationale representations: hard (binary), intermediate (thresholded), and soft (probabilistic) formats.
- Empirical evidence that soft representations improve both classification accuracy and explanation quality on hate‑speech detection benchmarks.
- Open‑source code and reproducibility package enabling developers to plug in their own models or datasets.
Methodology
- Data & Rationales – The authors use publicly available hate‑speech corpora that include token‑level human rationales (highlighted words that justify the label).
- Representation Spaces
- Hard: binary (e.g., “hate” vs. “not hate”, token is either part of the rationale or not).
- Intermediate: thresholded scores (e.g., “likely rationale”).
- Soft: full probability distributions over labels and rationales (capturing annotator uncertainty).
- Model Families – They re‑implement several state‑of‑the‑art classifiers (BERT‑based, CNN, LSTM) and explanation generators (attention‑based, gradient‑based, rationale‑extraction models).
- Loss Functions – Both standard cross‑entropy for hard labels and KL‑divergence‑based losses for soft targets are employed, sometimes jointly optimizing classification and rationale prediction.
- Metrics
- Classification: predictive (accuracy, F1) and distributional (expected calibration error, KL divergence to label distribution).
- Explainability:
- Plausibility – overlap with human rationales (e.g., token‑level F1).
- Faithfulness – how much the model’s prediction changes when rationales are perturbed.
- Complexity – length/size of the generated rationale (shorter is often preferred).
- Evaluation Protocol – Every model is trained and tested across all nine combinations of label/rationale representation (hard‑hard, hard‑soft, … soft‑soft) and evaluated on the full metric suite.
Results & Findings
| Representation | Classification (F1) | Plausibility (Token‑F1) | Faithfulness (Drop‑Score) |
|---|---|---|---|
| Hard‑Hard | 71.2 | 45.8 | 12.3 % |
| Hard‑Soft | 73.5 | 52.1 | 15.6 % |
| Soft‑Soft | 78.9 | 61.4 | 22.8 % |
- Soft label & rationale representations consistently outperform hard ones across all metrics, indicating they capture annotator uncertainty better.
- Plausibility improves when models are trained to predict soft rationales, suggesting that learning a distribution over explanations aligns more closely with human reasoning.
- Faithfulness gains (larger performance drop when rationales are removed) show that soft rationales are more integral to the model’s decision process.
- Complexity remains comparable; soft rationales are not significantly longer, debunking the myth that probabilistic explanations must be verbose.
Overall, the study demonstrates that evaluation pipelines that ignore rationale variability can misjudge both model quality and fairness, especially for subjective tasks.
Practical Implications
- Better moderation tools: Platforms can deploy classifiers that output calibrated probabilities and soft rationales, giving moderators a confidence score plus a nuanced explanation (e.g., “0.73 probability of hate speech, with 0.6 weight on the word ‘kill’ and 0.4 on the surrounding context”).
- Human‑in‑the‑loop workflows: Soft rationales enable annotators to see where the model is uncertain, facilitating quicker verification or correction.
- Bias detection: By examining the distribution of rationales across demographic groups, engineers can spot systematic over‑reliance on certain tokens that may reflect hidden biases.
- Model selection: The unified metric suite lets teams compare models not just on accuracy but also on how trustworthy and interpretable their explanations are—critical for compliance (e.g., GDPR “right to explanation”).
- Dataset design: The findings encourage dataset creators to collect soft rationales (e.g., multiple annotator highlights with confidence scores) rather than a single binary mask, enriching downstream training.
Limitations & Future Work
- Domain focus: Experiments are limited to English hate‑speech datasets; results may differ for other languages or domains (e.g., misinformation).
- Rationale granularity: Token‑level rationales ignore higher‑level discourse cues (sentence or paragraph importance). Extending the framework to hierarchical explanations is an open avenue.
- Scalability: Soft rationale training incurs higher computational cost due to additional loss terms and larger output spaces. Optimizing for efficiency remains a challenge.
- User studies: The paper evaluates plausibility and faithfulness automatically; real‑world user studies with moderators would solidify claims about practical usefulness.
Bottom line: By embracing the natural disagreement in both labels and explanations, developers can build hate‑speech detectors that are not only more accurate but also more transparent and trustworthy. The paper’s unified framework offers a ready‑to‑use blueprint for anyone looking to upgrade their NLP pipelines with richer, probabilistic reasoning.
Authors
- Benedetta Muscato
- Beiduo Chen
- Gizem Gezici
- Barbara Plank
- Fosca Giannotti
Paper Information
- arXiv ID: 2605.31563v1
- Categories: cs.CL
- Published: May 29, 2026
- PDF: Download PDF