[Paper] EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

Published: 5 days ago (May 5, 2026 at 01:20 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.03998v1

Overview

The paper EQUITRIAGE investigates whether large language models (LLMs) used for emergency department (ED) triage inherit the gender bias that has long plagued human clinicians. By auditing five popular LLMs on more than 18 k real‑world ED case vignettes (MIMIC‑IV‑ED) and their gender‑swapped counterparts, the authors reveal systematic “flip” rates—situations where a model changes a patient’s acuity score solely because the patient’s gender changes. The findings highlight that fairness is not a one‑size‑fits‑all property; each model behaves differently, and naïve prompting strategies can dramatically affect bias.

Key Contributions

Large‑scale fairness audit covering 374 k model evaluations on 18 714 clinical vignettes with gender‑counterfactual pairs.
Quantitative flip‑rate metric (percentage of cases where the predicted Emergency Severity Index changes after gender swap) and a pre‑registered 5 % fairness threshold.
Discovery of divergent bias patterns: two models show strong female under‑triage, two are near parity, and one shows high overall sensitivity with only a mild male‑direction bias.
Demonstration that fairness dimensions differ: group parity, counterfactual invariance, and calibration to downstream outcomes (e.g., admission) are not interchangeable.
Prompt engineering insights: demographic blinding (removing name/gender cues) can cut flip rates dramatically for some models, while chain‑of‑thought prompting harms accuracy across the board.
Mechanistic ablation showing that the same directional bias can arise from different internal signals (e.g., name + gender token vs. gender token alone).

Methodology

Dataset – 9 368 original ED triage notes from the MIMIC‑IV‑ED database were duplicated with a gender‑swapped version (e.g., “he” → “she”, name changes) yielding 9 346 counterfactual pairs.
Models evaluated – Gemini‑3‑Flash, Nemotron‑3‑Super, DeepSeek‑V3.1, Mistral‑Small‑3.2, and GPT‑4.1‑Nano.
Prompt strategies – four variants:
- (a) baseline prompt,
- (b) demographic‑blinded prompt (removing name/gender),
- (c) age‑preserving blind prompt, and
- (d) chain‑of‑thought (CoT) prompting that asks the model to “think step‑by‑step”.
Fairness metrics –
- Flip rate: proportion of counterfactual pairs where the predicted Emergency Severity Index (ESI) differs.
- Directional bias ratio (F/M): ratio of female‑under‑triage flips to male‑under‑triage flips.
- Calibration gap: difference between predicted ESI and actual admission outcome in the original MIMIC‑IV data.
Statistical analysis – pre‑registered 5 % flip‑rate threshold; Chouldechova‑style dissociation analysis to separate within‑group calibration from between‑pair invariance.
Ablation study – swapping only the gender token vs. swapping both name and gender to isolate the source of bias for Gemini and DeepSeek.

Results & Findings

Model	Overall Flip Rate	Directional F/M Ratio	Calibration Gap (vs. admission)
DeepSeek‑V3.1	43.8 % (highest)	2.15 : 1 (female under‑triage)	0.013 (very low)
Gemini‑3‑Flash	9.9 %	1.34 : 1 (female under‑triage)	–
Nemotron‑3‑Super	Near‑parity (≈5 %)	≈1 : 1	–
Mistral‑Small‑3.2	Near‑parity (≈5 %)	≈1 : 1	–
GPT‑4.1‑Nano	High sensitivity, slight male‑direction bias	<1 : 1	–

All models exceed the 5 % flip‑rate threshold, meaning none can be declared “fair” by that simple metric.
DeepSeek’s strong bias coexists with excellent calibration, indicating that a model can be accurate overall yet still treat genders unequally.
Demographic blinding reduces Gemini’s flip rate to 0.5 %, essentially eliminating its bias, while DeepSeek still shows a residual 1.25 : 1 bias, suggesting age information leaks gender signals.
Chain‑of‑thought prompting uniformly degrades triage accuracy, showing that more “explainable” prompts are not automatically beneficial in high‑stakes clinical settings.
Ablation results reveal that Gemini’s bias emerges only when both name and gender are swapped together, whereas DeepSeek’s bias is driven solely by the gender token.

Practical Implications

Model‑specific audits are mandatory before deploying LLM‑based triage tools; a “one‑size‑fits‑all” fairness checklist will miss hidden biases.
Prompt engineering can be a low‑cost mitigation: stripping explicit demographic cues may neutralize bias for some models (e.g., Gemini) but not all, so developers must test each combination.
Calibration alone is insufficient: a model that predicts admissions well can still systematically under‑triage female patients, potentially leading to delayed care and worse outcomes.
Regulatory and compliance teams should consider flip‑rate thresholds and directional bias ratios as part of AI‑medical device certification.
Healthcare IT platforms can integrate a “fairness layer” that automatically runs gender‑counterfactual checks on new LLM updates, flagging regressions before they reach clinicians.
Developers of future LLMs may need to embed fairness constraints at the pre‑training stage (e.g., balanced gender token representations) rather than relying solely on post‑hoc prompting.

Limitations & Future Work

The audit is limited to gender; other protected attributes (race, socioeconomic status) remain unexamined.
MIMIC‑IV‑ED reflects a single health system and historical data; real‑world deployment may encounter different documentation styles and patient demographics.
The study focuses on ESI assignment; downstream clinical decisions (e.g., resource allocation, physician ordering) were not evaluated.
Prompt variations explored are a small subset of possible designs; more sophisticated context‑preserving or multimodal prompts could behave differently.
Future research should expand to multilingual settings, incorporate continuous monitoring post‑deployment, and explore training‑time interventions (e.g., bias‑aware fine‑tuning) to reduce reliance on prompt‑level fixes.

Authors

Richard J. Young
Alice M. Matthews

Paper Information

arXiv ID: 2605.03998v1
Categories: cs.CL, cs.CY
Published: May 5, 2026
PDF: Download PDF

[Paper] EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

[Paper] Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation