[Paper] Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling
Source: arXiv - 2601.02337v1
Overview
Detecting toxic language is notoriously subjective—what one group finds offensive, another may not. This paper tackles that challenge by systematically evaluating how large language models (LLMs) respond to toxicity prompts when they are “personalized” with different demographic personas. The authors show that no single prompting recipe works best for every model‑persona pair, and they introduce a lightweight ensemble that consistently boosts performance across the board.
Key Contributions
- First systematic comparison of persona‑conditioned prompting strategies for toxicity detection across multiple LLMs.
- Automated prompt‑optimization framework that searches for prompts tailored to a given persona‑model combination.
- Meta‑ensemble technique: a simple linear SVM that takes a 4‑bit vector of predictions from four distinct prompting variants and learns to combine them.
- Empirical evidence that the SVM meta‑ensemble outperforms each individual prompt and classic majority‑vote ensembling on a diverse set of personas.
- Open‑source evaluation pipeline that can be reused for other subjective NLP tasks (e.g., hate‑speech, bias detection).
Methodology
- Persona Definition – The authors define a set of demographic personas (e.g., “young Black woman”, “older white male”) that encode social priors influencing toxicity perception.
- Prompt Variants – Four prompting styles are explored:
- Base prompt (plain toxicity query)
- Persona‑injected prompt (explicitly mentions the persona)
- Optimized prompt (generated via an automated search over prompt templates)
- Hybrid prompt (combines persona and optimization cues)
- Model Suite – Experiments run on several open‑source LLMs (e.g., LLaMA‑2, Falcon, Mistral) to capture variability across architectures.
- Ensembling – Each prompt yields a binary toxicity label. The four labels form a 4‑bit vector per example. An SVM is trained on these vectors (using a small validation set) to predict the final label, learning which combinations are most reliable.
- Evaluation – Standard metrics (F1, precision, recall) are computed per persona and aggregated to assess overall robustness.
Results & Findings
| Prompt Variant | Avg. F1 (across personas) | Majority‑Vote F1 | SVM Meta‑Ensemble F1 |
|---|---|---|---|
| Base | 0.71 | — | 0.78 |
| Persona‑injected | 0.73 | — | 0.79 |
| Optimized | 0.74 | — | 0.80 |
| Hybrid | 0.75 | — | 0.82 |
| Majority Vote (4‑bit) | — | 0.77 | — |
| SVM Ensemble | — | — | 0.82 |
- No single prompt dominates; performance varies noticeably across model‑persona pairs.
- The SVM meta‑ensemble consistently beats the best individual prompt and the naïve majority‑vote baseline.
- Gains are most pronounced for personas that historically suffer higher false‑negative rates (e.g., marginalized groups).
Practical Implications
- More equitable moderation tools – Deploying the SVM meta‑ensemble can reduce bias against under‑represented groups without sacrificing overall detection quality.
- Plug‑and‑play safety layer – Since the ensemble only needs four binary predictions, it can be added on top of existing LLM‑based moderation pipelines with minimal latency overhead.
- Rapid persona adaptation – The automated prompt optimizer can be rerun when new demographic personas need to be supported, making the system future‑proof.
- Generalizable framework – The same ensemble logic can be applied to other subjective classification tasks (e.g., political bias detection, sentiment analysis) where multiple viewpoints matter.
Limitations & Future Work
- Persona granularity – The study uses a limited set of handcrafted personas; real‑world users may identify with more nuanced intersectional identities.
- Scalability of optimization – The prompt‑search procedure can be computationally expensive for very large models, though the final ensemble remains lightweight.
- Dataset bias – Evaluation relies on existing toxicity benchmarks that may not fully capture the diversity of real‑world online discourse.
- Future directions suggested include: expanding to multilingual LLMs, exploring richer ensemble learners (e.g., neural meta‑models), and integrating user‑feedback loops to continuously refine persona representations.
Authors
- Berk Atil
- Rebecca J. Passonneau
- Ninareh Mehrabi
Paper Information
- arXiv ID: 2601.02337v1
- Categories: cs.CL
- Published: January 5, 2026
- PDF: Download PDF