[Paper] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Published: (February 10, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.10117v1

Overview

Large language models (LLMs) are increasingly used for high‑stakes decisions—hiring, loan approvals, university admissions—by generating chain‑of‑thought (CoT) explanations that look sensible. However, these explanations can hide unverbalized biases: systematic preferences that the model never mentions. The paper introduces a fully automated, black‑box pipeline that discovers such hidden, task‑specific biases without needing pre‑defined bias categories or hand‑crafted test sets.

Key Contributions

  • Black‑box bias discovery pipeline that works with any LLM given only a task dataset.
  • Automatic generation of candidate bias concepts using the LLM itself as an “autorater.”
  • Statistical testing framework (multiple‑testing correction, early stopping) to flag concepts that cause performance gaps yet are absent from the model’s CoT justifications.
  • Empirical validation on six popular LLMs across three decision‑making tasks (hiring, loan approval, university admissions).
  • Discovery of both known and novel biases, e.g., preferences for Spanish fluency, English proficiency, and writing formality, in addition to classic gender, race, religion, and ethnicity biases.

Methodology

  1. Input – A labeled dataset for a downstream decision task (e.g., applicant resumes with hire/reject labels).
  2. Concept generation – The LLM is prompted to list plausible “bias concepts” that could affect the decision (e.g., “native language,” “writing style”). This step is completely automated; no human‑curated list is required.
  3. Perturbation creation – For each candidate concept, the pipeline synthesizes positive (concept present) and negative (concept absent) variants of the original inputs by minimally editing the text (e.g., swapping a Spanish‑language sentence with an English one).
  4. Model evaluation – The target LLM processes both variants, producing predictions and CoT explanations.
  5. Statistical analysis – Using hypothesis testing across many samples, the pipeline checks whether the concept leads to a significant performance difference (e.g., higher acceptance rate for English‑fluent applicants). Simultaneously, it verifies that the concept never appears in the model’s CoT.
  6. Multiple‑testing correction & early stopping – To keep false‑positive rates low, the method applies Bonferroni‑type corrections and stops testing a concept once enough evidence is gathered.

If a concept satisfies both conditions—performance impact and absence from the reasoning trace—it is flagged as an unverbalized bias.

Results & Findings

  • Across the three tasks, the pipeline identified all previously reported biases (gender, race, religion, ethnicity) and uncovered new ones such as:
    • Preference for applicants who write in Spanish fluently (in a hiring dataset).
    • Higher acceptance for candidates with English‑level proficiency in loan‑approval scenarios.
    • Favoring formal writing style over informal in university admissions.
  • The discovered biases were consistent across multiple LLM families (e.g., GPT‑3.5, Claude, LLaMA) but varied in magnitude, highlighting model‑specific risk profiles.
  • Early‑stopping reduced the number of required model queries by ~40 % without sacrificing detection power, demonstrating the pipeline’s scalability.

Practical Implications

  • Automated audit tools: Companies can integrate this pipeline into their model‑deployment CI/CD pipelines to surface hidden biases before releasing a system to production.
  • Regulatory compliance: The method offers evidence‑based bias detection that aligns with emerging AI‑fairness regulations, which often require “explainability” beyond surface‑level reasoning.
  • Model selection & fine‑tuning: Developers can compare candidate LLMs on the same task and pick the one with the fewest unverbalized biases, or use the identified concepts to guide targeted fine‑tuning or data augmentation.
  • User‑trust diagnostics: By surfacing biases that the model itself does not acknowledge, product teams can design better user‑facing disclosures (e.g., “Our system may favor applicants with certain language characteristics”).

Limitations & Future Work

  • Reliance on LLM‑generated concepts: If the model fails to propose a relevant bias concept, the pipeline cannot test it, potentially missing subtle biases.
  • Perturbation realism: Synthetic edits (e.g., swapping language fluency) may not fully capture real‑world variations, which could affect the external validity of findings.
  • Scope to text‑centric tasks: The current design assumes textual inputs; extending to multimodal or structured data (images, tables) remains an open challenge.
  • Statistical power vs. cost trade‑off: While early stopping reduces queries, very low‑frequency biases may still require large sample sizes to detect. Future work could explore adaptive sampling or active learning to focus effort on the most promising concepts.

Bottom line: This paper delivers a practical, black‑box solution for surfacing hidden, task‑specific biases in LLMs—something that static fairness checklists and CoT explanations alone cannot guarantee. For developers building AI‑driven decision systems, the pipeline offers a scalable way to uncover and mitigate unfair behavior before it reaches end users.

Authors

  • Iván Arcuschin
  • David Chanin
  • Adrià Garriga-Alonso
  • Oana-Maria Camburu

Paper Information

  • arXiv ID: 2602.10117v1
  • Categories: cs.LG, cs.AI
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »