[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

Published: 1 week ago (May 29, 2026 at 01:29 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.31563v1

Overview

The paper Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection examines a hidden source of noise in NLP datasets—disagreement not only in labels but also in the token‑level explanations (rationales) that humans provide. By systematically re‑implementing a suite of classification and explanation models, the authors show that traditional evaluation pipelines (majority‑vote labels + hard rationales) miss a lot of the nuance that arises in subjective tasks like hate‑speech detection. Their unified framework reveals that softer, probabilistic representations of both labels and rationales lead to more reliable performance and richer insights.

Key Contributions

Unified evaluation protocol that brings together diverse model architectures, loss functions, and existing metrics under a single, reproducible pipeline.
Three‑dimensional explainability taxonomy (plausibility, faithfulness, complexity) applied consistently across models.
Systematic comparison of label and rationale representations: hard (binary), intermediate (thresholded), and soft (probabilistic) formats.
Empirical evidence that soft representations improve both classification accuracy and explanation quality on hate‑speech detection benchmarks.
Open‑source code and reproducibility package enabling developers to plug in their own models or datasets.

Methodology

Data & Rationales – The authors use publicly available hate‑speech corpora that include token‑level human rationales (highlighted words that justify the label).
Representation Spaces
- Hard: binary (e.g., “hate” vs. “not hate”, token is either part of the rationale or not).
- Intermediate: thresholded scores (e.g., “likely rationale”).
- Soft: full probability distributions over labels and rationales (capturing annotator uncertainty).
Model Families – They re‑implement several state‑of‑the‑art classifiers (BERT‑based, CNN, LSTM) and explanation generators (attention‑based, gradient‑based, rationale‑extraction models).
Loss Functions – Both standard cross‑entropy for hard labels and KL‑divergence‑based losses for soft targets are employed, sometimes jointly optimizing classification and rationale prediction.
Metrics
- Classification: predictive (accuracy, F1) and distributional (expected calibration error, KL divergence to label distribution).
- Explainability:
  - Plausibility – overlap with human rationales (e.g., token‑level F1).
  - Faithfulness – how much the model’s prediction changes when rationales are perturbed.
  - Complexity – length/size of the generated rationale (shorter is often preferred).
Evaluation Protocol – Every model is trained and tested across all nine combinations of label/rationale representation (hard‑hard, hard‑soft, … soft‑soft) and evaluated on the full metric suite.

Results & Findings

Representation	Classification (F1)	Plausibility (Token‑F1)	Faithfulness (Drop‑Score)
Hard‑Hard	71.2	45.8	12.3 %
Hard‑Soft	73.5	52.1	15.6 %
Soft‑Soft	78.9	61.4	22.8 %

Soft label & rationale representations consistently outperform hard ones across all metrics, indicating they capture annotator uncertainty better.
Plausibility improves when models are trained to predict soft rationales, suggesting that learning a distribution over explanations aligns more closely with human reasoning.
Faithfulness gains (larger performance drop when rationales are removed) show that soft rationales are more integral to the model’s decision process.
Complexity remains comparable; soft rationales are not significantly longer, debunking the myth that probabilistic explanations must be verbose.

Overall, the study demonstrates that evaluation pipelines that ignore rationale variability can misjudge both model quality and fairness, especially for subjective tasks.

Practical Implications

Better moderation tools: Platforms can deploy classifiers that output calibrated probabilities and soft rationales, giving moderators a confidence score plus a nuanced explanation (e.g., “0.73 probability of hate speech, with 0.6 weight on the word ‘kill’ and 0.4 on the surrounding context”).
Human‑in‑the‑loop workflows: Soft rationales enable annotators to see where the model is uncertain, facilitating quicker verification or correction.
Bias detection: By examining the distribution of rationales across demographic groups, engineers can spot systematic over‑reliance on certain tokens that may reflect hidden biases.
Model selection: The unified metric suite lets teams compare models not just on accuracy but also on how trustworthy and interpretable their explanations are—critical for compliance (e.g., GDPR “right to explanation”).
Dataset design: The findings encourage dataset creators to collect soft rationales (e.g., multiple annotator highlights with confidence scores) rather than a single binary mask, enriching downstream training.

Limitations & Future Work

Domain focus: Experiments are limited to English hate‑speech datasets; results may differ for other languages or domains (e.g., misinformation).
Rationale granularity: Token‑level rationales ignore higher‑level discourse cues (sentence or paragraph importance). Extending the framework to hierarchical explanations is an open avenue.
Scalability: Soft rationale training incurs higher computational cost due to additional loss terms and larger output spaces. Optimizing for efficiency remains a challenge.
User studies: The paper evaluates plausibility and faithfulness automatically; real‑world user studies with moderators would solidify claims about practical usefulness.

Bottom line: By embracing the natural disagreement in both labels and explanations, developers can build hate‑speech detectors that are not only more accurate but also more transparent and trustworthy. The paper’s unified framework offers a ready‑to‑use blueprint for anyone looking to upgrade their NLP pipelines with richer, probabilistic reasoning.

Authors

Benedetta Muscato
Beiduo Chen
Gizem Gezici
Barbara Plank
Fosca Giannotti

Paper Information

arXiv ID: 2605.31563v1
Categories: cs.CL
Published: May 29, 2026
PDF: Download PDF

[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

[Paper] LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

[Paper] What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

[Paper] What Am I Missing? Question-Answering as Hidden State Probing