[Paper] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Published: 2 days ago (February 10, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10117v1

Overview

Large language models (LLMs) are increasingly used for high‑stakes decisions—hiring, loan approvals, university admissions—by generating chain‑of‑thought (CoT) explanations that look sensible. However, these explanations can hide unverbalized biases: systematic preferences that the model never mentions. The paper introduces a fully automated, black‑box pipeline that discovers such hidden, task‑specific biases without needing pre‑defined bias categories or hand‑crafted test sets.

Key Contributions

Black‑box bias discovery pipeline that works with any LLM given only a task dataset.
Automatic generation of candidate bias concepts using the LLM itself as an “autorater.”
Statistical testing framework (multiple‑testing correction, early stopping) to flag concepts that cause performance gaps yet are absent from the model’s CoT justifications.
Empirical validation on six popular LLMs across three decision‑making tasks (hiring, loan approval, university admissions).
Discovery of both known and novel biases, e.g., preferences for Spanish fluency, English proficiency, and writing formality, in addition to classic gender, race, religion, and ethnicity biases.

Methodology

Input – A labeled dataset for a downstream decision task (e.g., applicant resumes with hire/reject labels).
Concept generation – The LLM is prompted to list plausible “bias concepts” that could affect the decision (e.g., “native language,” “writing style”). This step is completely automated; no human‑curated list is required.
Perturbation creation – For each candidate concept, the pipeline synthesizes positive (concept present) and negative (concept absent) variants of the original inputs by minimally editing the text (e.g., swapping a Spanish‑language sentence with an English one).
Model evaluation – The target LLM processes both variants, producing predictions and CoT explanations.
Statistical analysis – Using hypothesis testing across many samples, the pipeline checks whether the concept leads to a significant performance difference (e.g., higher acceptance rate for English‑fluent applicants). Simultaneously, it verifies that the concept never appears in the model’s CoT.
Multiple‑testing correction & early stopping – To keep false‑positive rates low, the method applies Bonferroni‑type corrections and stops testing a concept once enough evidence is gathered.

If a concept satisfies both conditions—performance impact and absence from the reasoning trace—it is flagged as an unverbalized bias.

Results & Findings

Across the three tasks, the pipeline identified all previously reported biases (gender, race, religion, ethnicity) and uncovered new ones such as:
- Preference for applicants who write in Spanish fluently (in a hiring dataset).
- Higher acceptance for candidates with English‑level proficiency in loan‑approval scenarios.
- Favoring formal writing style over informal in university admissions.
The discovered biases were consistent across multiple LLM families (e.g., GPT‑3.5, Claude, LLaMA) but varied in magnitude, highlighting model‑specific risk profiles.
Early‑stopping reduced the number of required model queries by ~40 % without sacrificing detection power, demonstrating the pipeline’s scalability.

Practical Implications

Automated audit tools: Companies can integrate this pipeline into their model‑deployment CI/CD pipelines to surface hidden biases before releasing a system to production.
Regulatory compliance: The method offers evidence‑based bias detection that aligns with emerging AI‑fairness regulations, which often require “explainability” beyond surface‑level reasoning.
Model selection & fine‑tuning: Developers can compare candidate LLMs on the same task and pick the one with the fewest unverbalized biases, or use the identified concepts to guide targeted fine‑tuning or data augmentation.
User‑trust diagnostics: By surfacing biases that the model itself does not acknowledge, product teams can design better user‑facing disclosures (e.g., “Our system may favor applicants with certain language characteristics”).

Limitations & Future Work

Reliance on LLM‑generated concepts: If the model fails to propose a relevant bias concept, the pipeline cannot test it, potentially missing subtle biases.
Perturbation realism: Synthetic edits (e.g., swapping language fluency) may not fully capture real‑world variations, which could affect the external validity of findings.
Scope to text‑centric tasks: The current design assumes textual inputs; extending to multimodal or structured data (images, tables) remains an open challenge.
Statistical power vs. cost trade‑off: While early stopping reduces queries, very low‑frequency biases may still require large sample sizes to detect. Future work could explore adaptive sampling or active learning to focus effort on the most promising concepts.

Bottom line: This paper delivers a practical, black‑box solution for surfacing hidden, task‑specific biases in LLMs—something that static fairness checklists and CoT explanations alone cannot guarantee. For developers building AI‑driven decision systems, the pipeline offers a scalable way to uncover and mitigate unfair behavior before it reaches end users.

Authors

Iván Arcuschin
David Chanin
Adrià Garriga-Alonso
Oana-Maria Camburu

Paper Information

arXiv ID: 2602.10117v1
Categories: cs.LG, cs.AI
Published: February 10, 2026
PDF: Download PDF

[Paper] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

[Paper] Agentic Test-Time Scaling for WebAgents