[Paper] Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Published: (February 16, 2026 at 12:56 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.14970v1

Overview

Large Language Models (LLMs) are now being used to automatically grade contact‑center agents and generate coaching feedback. This paper investigates whether those models treat all agents fairly, or whether hidden biases cause systematic score swings when demographic or contextual cues change. By applying a counterfactual fairness audit to 18 LLMs on 3 K real transcripts, the authors expose measurable unfairness and propose a lightweight prompting fix.

Key Contributions

  • Comprehensive fairness benchmark – 13 bias dimensions (identity, context, behavioral style) organized into three categories, each paired with a counterfactual transcript.
  • Two quantitative fairness metrics – Counterfactual Flip Rate (CFR) for binary judgment changes and Mean Absolute Score Difference (MASD) for continuous coaching scores.
  • Large‑scale empirical study – Evaluation of 18 open‑source and commercial LLMs on 3 000 real‑world contact‑center interactions.
  • Insightful correlation analysis – Larger, strongly instruction‑aligned models tend to be less unfair, but fairness does not align with raw accuracy.
  • Prompt‑level mitigation experiment – Testing “fairness‑aware” prompts that explicitly ask the model to ignore identity cues, showing only modest gains.
  • Practical audit pipeline – A reproducible workflow that can be integrated into CI/CD for any LLM‑driven QA system.

Methodology

  1. Data collection – 3 000 anonymized contact‑center call transcripts, each annotated with the agent’s performance score (confidence, positivity, improvement).
  2. Counterfactual generation – For every transcript, the authors create variants that swap identity markers (e.g., gendered names, accent cues), alter contextual information (e.g., prior performance history), or modify behavioral style (e.g., politeness level) while keeping the core conversation unchanged.
  3. Model suite – 18 LLMs ranging from 2 B‑parameter open‑source models to commercial 175 B‑parameter systems, each queried with the same prompt template used in production QA pipelines.
  4. Metrics
    • CFR = % of cases where the binary pass/fail decision flips between the original and its counterfactual.
    • MASD = average absolute difference in the numeric coaching scores (0‑100) across the pair.
  5. Prompt‑based mitigation – Two variants: (a) “fairness‑aware” prompt that explicitly instructs the model to ignore identity cues, and (b) a control prompt with no extra instruction.

The whole process is scripted in Python, using the OpenAI API or Hugging Face inference endpoints, and the results are logged to a shared dashboard for easy inspection.

Results & Findings

Model size / alignmentCFR (overall)MASD (confidence)Notable bias source
Small open‑source (2 B)13.0 %4.8 pointsImplicit name gender cues
Mid‑size (7 B)9.2 %3.6 pointsHistorical performance priming
Large commercial (175 B)5.4 %2.1 pointsMinimal but present linguistic cues
  • Systematic disparities – All models exhibit non‑zero CFR; the worst cases exceed 16 % when the prompt includes prior performance context.
  • Score drift – MASD shows consistent upward or downward shifts (up to 5 points) for certain identity groups, even when the underlying call quality is identical.
  • Size vs. fairness trade‑off – Bigger, instruction‑tuned models are generally fairer, but fairness does not correlate with traditional accuracy (e.g., F1 on a separate QA benchmark).
  • Prompt mitigation – Adding a fairness instruction reduces CFR by an average of 1.2 % and MASD by ~0.4 points—statistically significant but insufficient for high‑stakes deployment.

Practical Implications

  • Audit before rollout – Companies should embed a counterfactual fairness test into their model‑validation CI pipelines, especially for any HR‑related scoring system.
  • Model selection – When fairness is a priority, favor larger, instruction‑tuned LLMs, but still verify with domain‑specific audits; size alone isn’t a guarantee.
  • Prompt engineering limits – Simple “ignore identity” prompts help but cannot replace systematic data‑level or architectural interventions.
  • Regulatory compliance – The metrics (CFR, MASD) provide concrete evidence that can be reported to auditors or used to satisfy emerging AI‑fairness regulations.
  • Product design – Consider decoupling the raw LLM judgment from the final score (e.g., using a calibrated post‑processor) to dampen bias amplification.

Limitations & Future Work

  • Scope of counterfactuals – Only 13 dimensions were explored; real‑world bias can be more nuanced (e.g., intersectional identities).
  • Static prompts – The study used a single prompt template; dynamic or multi‑turn prompting could behave differently.
  • Generalizability – Experiments were limited to English‑language contact‑center data from a single industry; results may vary across languages or sectors.
  • Mitigation depth – The paper only tested prompt‑level fixes; future work should explore fine‑tuning with fairness‑aware loss functions, data augmentation, or model‑level debiasing techniques.

By highlighting both the promise and the hidden pitfalls of LLM‑driven QA, this research gives developers a concrete roadmap for building more equitable AI tools in the contact‑center ecosystem.

Authors

  • Kawin Mayilvaghanan
  • Siddhant Gupta
  • Ayush Kumar

Paper Information

  • arXiv ID: 2602.14970v1
  • Categories: cs.CL
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »