[Paper] Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Published: 3 days ago (February 16, 2026 at 12:56 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.14970v1

Overview

Large Language Models (LLMs) are now being used to automatically grade contact‑center agents and generate coaching feedback. This paper investigates whether those models treat all agents fairly, or whether hidden biases cause systematic score swings when demographic or contextual cues change. By applying a counterfactual fairness audit to 18 LLMs on 3 K real transcripts, the authors expose measurable unfairness and propose a lightweight prompting fix.

Key Contributions

Comprehensive fairness benchmark – 13 bias dimensions (identity, context, behavioral style) organized into three categories, each paired with a counterfactual transcript.
Two quantitative fairness metrics – Counterfactual Flip Rate (CFR) for binary judgment changes and Mean Absolute Score Difference (MASD) for continuous coaching scores.
Large‑scale empirical study – Evaluation of 18 open‑source and commercial LLMs on 3 000 real‑world contact‑center interactions.
Insightful correlation analysis – Larger, strongly instruction‑aligned models tend to be less unfair, but fairness does not align with raw accuracy.
Prompt‑level mitigation experiment – Testing “fairness‑aware” prompts that explicitly ask the model to ignore identity cues, showing only modest gains.
Practical audit pipeline – A reproducible workflow that can be integrated into CI/CD for any LLM‑driven QA system.

Methodology

Data collection – 3 000 anonymized contact‑center call transcripts, each annotated with the agent’s performance score (confidence, positivity, improvement).
Counterfactual generation – For every transcript, the authors create variants that swap identity markers (e.g., gendered names, accent cues), alter contextual information (e.g., prior performance history), or modify behavioral style (e.g., politeness level) while keeping the core conversation unchanged.
Model suite – 18 LLMs ranging from 2 B‑parameter open‑source models to commercial 175 B‑parameter systems, each queried with the same prompt template used in production QA pipelines.
Metrics
- CFR = % of cases where the binary pass/fail decision flips between the original and its counterfactual.
- MASD = average absolute difference in the numeric coaching scores (0‑100) across the pair.
Prompt‑based mitigation – Two variants: (a) “fairness‑aware” prompt that explicitly instructs the model to ignore identity cues, and (b) a control prompt with no extra instruction.

The whole process is scripted in Python, using the OpenAI API or Hugging Face inference endpoints, and the results are logged to a shared dashboard for easy inspection.

Results & Findings

Model size / alignment	CFR (overall)	MASD (confidence)	Notable bias source
Small open‑source (2 B)	13.0 %	4.8 points	Implicit name gender cues
Mid‑size (7 B)	9.2 %	3.6 points	Historical performance priming
Large commercial (175 B)	5.4 %	2.1 points	Minimal but present linguistic cues

Systematic disparities – All models exhibit non‑zero CFR; the worst cases exceed 16 % when the prompt includes prior performance context.
Score drift – MASD shows consistent upward or downward shifts (up to 5 points) for certain identity groups, even when the underlying call quality is identical.
Size vs. fairness trade‑off – Bigger, instruction‑tuned models are generally fairer, but fairness does not correlate with traditional accuracy (e.g., F1 on a separate QA benchmark).
Prompt mitigation – Adding a fairness instruction reduces CFR by an average of 1.2 % and MASD by ~0.4 points—statistically significant but insufficient for high‑stakes deployment.

Practical Implications

Audit before rollout – Companies should embed a counterfactual fairness test into their model‑validation CI pipelines, especially for any HR‑related scoring system.
Model selection – When fairness is a priority, favor larger, instruction‑tuned LLMs, but still verify with domain‑specific audits; size alone isn’t a guarantee.
Prompt engineering limits – Simple “ignore identity” prompts help but cannot replace systematic data‑level or architectural interventions.
Regulatory compliance – The metrics (CFR, MASD) provide concrete evidence that can be reported to auditors or used to satisfy emerging AI‑fairness regulations.
Product design – Consider decoupling the raw LLM judgment from the final score (e.g., using a calibrated post‑processor) to dampen bias amplification.

Limitations & Future Work

Scope of counterfactuals – Only 13 dimensions were explored; real‑world bias can be more nuanced (e.g., intersectional identities).
Static prompts – The study used a single prompt template; dynamic or multi‑turn prompting could behave differently.
Generalizability – Experiments were limited to English‑language contact‑center data from a single industry; results may vary across languages or sectors.
Mitigation depth – The paper only tested prompt‑level fixes; future work should explore fine‑tuning with fairness‑aware loss functions, data augmentation, or model‑level debiasing techniques.

By highlighting both the promise and the hidden pitfalls of LLM‑driven QA, this research gives developers a concrete roadmap for building more equitable AI tools in the contact‑center ecosystem.

Authors

Kawin Mayilvaghanan
Siddhant Gupta
Ayush Kumar

Paper Information

arXiv ID: 2602.14970v1
Categories: cs.CL
Published: February 16, 2026
PDF: Download PDF

[Paper] Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Reinforced Fast Weights with Next-Sequence Prediction

[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

[Paper] Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

[Paper] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment