[Paper] Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling

Published: 2 weeks ago (January 5, 2026 at 01:32 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.02337v1

Overview

Detecting toxic language is notoriously subjective—what one group finds offensive, another may not. This paper tackles that challenge by systematically evaluating how large language models (LLMs) respond to toxicity prompts when they are “personalized” with different demographic personas. The authors show that no single prompting recipe works best for every model‑persona pair, and they introduce a lightweight ensemble that consistently boosts performance across the board.

Key Contributions

First systematic comparison of persona‑conditioned prompting strategies for toxicity detection across multiple LLMs.
Automated prompt‑optimization framework that searches for prompts tailored to a given persona‑model combination.
Meta‑ensemble technique: a simple linear SVM that takes a 4‑bit vector of predictions from four distinct prompting variants and learns to combine them.
Empirical evidence that the SVM meta‑ensemble outperforms each individual prompt and classic majority‑vote ensembling on a diverse set of personas.
Open‑source evaluation pipeline that can be reused for other subjective NLP tasks (e.g., hate‑speech, bias detection).

Methodology

Persona Definition – The authors define a set of demographic personas (e.g., “young Black woman”, “older white male”) that encode social priors influencing toxicity perception.
Prompt Variants – Four prompting styles are explored:
- Base prompt (plain toxicity query)
- Persona‑injected prompt (explicitly mentions the persona)
- Optimized prompt (generated via an automated search over prompt templates)
- Hybrid prompt (combines persona and optimization cues)
Model Suite – Experiments run on several open‑source LLMs (e.g., LLaMA‑2, Falcon, Mistral) to capture variability across architectures.
Ensembling – Each prompt yields a binary toxicity label. The four labels form a 4‑bit vector per example. An SVM is trained on these vectors (using a small validation set) to predict the final label, learning which combinations are most reliable.
Evaluation – Standard metrics (F1, precision, recall) are computed per persona and aggregated to assess overall robustness.

Results & Findings

Prompt Variant	Avg. F1 (across personas)	Majority‑Vote F1	SVM Meta‑Ensemble F1
Base	0.71	—	0.78
Persona‑injected	0.73	—	0.79
Optimized	0.74	—	0.80
Hybrid	0.75	—	0.82
Majority Vote (4‑bit)	—	0.77	—
SVM Ensemble	—	—	0.82

No single prompt dominates; performance varies noticeably across model‑persona pairs.
The SVM meta‑ensemble consistently beats the best individual prompt and the naïve majority‑vote baseline.
Gains are most pronounced for personas that historically suffer higher false‑negative rates (e.g., marginalized groups).

Practical Implications

More equitable moderation tools – Deploying the SVM meta‑ensemble can reduce bias against under‑represented groups without sacrificing overall detection quality.
Plug‑and‑play safety layer – Since the ensemble only needs four binary predictions, it can be added on top of existing LLM‑based moderation pipelines with minimal latency overhead.
Rapid persona adaptation – The automated prompt optimizer can be rerun when new demographic personas need to be supported, making the system future‑proof.
Generalizable framework – The same ensemble logic can be applied to other subjective classification tasks (e.g., political bias detection, sentiment analysis) where multiple viewpoints matter.

Limitations & Future Work

Persona granularity – The study uses a limited set of handcrafted personas; real‑world users may identify with more nuanced intersectional identities.
Scalability of optimization – The prompt‑search procedure can be computationally expensive for very large models, though the final ensemble remains lightweight.
Dataset bias – Evaluation relies on existing toxicity benchmarks that may not fully capture the diversity of real‑world online discourse.
Future directions suggested include: expanding to multilingual LLMs, exploring richer ensemble learners (e.g., neural meta‑models), and integrating user‑feedback loops to continuously refine persona representations.

Authors

Berk Atil
Rebecca J. Passonneau
Ninareh Mehrabi

Paper Information

arXiv ID: 2601.02337v1
Categories: cs.CL
Published: January 5, 2026
PDF: Download PDF

[Paper] Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents