[Paper] Safety and accuracy follow different scaling laws in clinical large language models

Published: 5 days ago (May 5, 2026 at 01:57 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.04039v1

Overview

The paper investigates how safety and accuracy evolve as clinical large language models (LLMs) are scaled up. By introducing a systematic evaluation framework (SaFE‑Scale) and a radiology‑focused benchmark (RadSaFE‑200), the authors show that bigger models or more compute do not automatically yield safer behavior—deployment choices such as evidence quality and retrieval strategy matter far more.

Key Contributions

SaFE‑Scale framework: a reproducible methodology to assess safety across model size, context length, retrieval complexity, and inference‑time compute.
RadSaFE‑200 benchmark: 200 radiology multiple‑choice questions annotated with clean evidence, conflicting evidence, and fine‑grained safety labels (high‑risk error, unsafe answer, evidence contradiction).
Comprehensive empirical study: evaluation of 34 locally‑deployed LLMs under six deployment conditions (zero‑shot, clean evidence, conflict evidence, standard RAG, agentic RAG, max‑context prompting).
Empirical finding: high‑quality (clean) evidence dramatically improves both accuracy (↑ 20 pp) and safety metrics (high‑risk errors ↓ 9.4 pp, contradictions ↓ 10.4 pp, dangerous overconfidence ↓ 6.4 pp).
Insight on retrieval designs: standard RAG and agentic RAG do not inherit the safety gains of clean evidence; agentic RAG reduces contradictions but leaves high‑risk errors high.
Latency vs. safety trade‑off: max‑context prompting inflates inference latency without closing the safety gap.
Worst‑case analysis: clinically consequential failures cluster in a small subset of questions, highlighting the need for targeted safeguards.

Methodology

Benchmark construction (RadSaFE‑200)
- Curated 200 radiology MCQs from board‑style exams.
- For each question, clinicians supplied:
  - Clean evidence – unambiguous, high‑quality references.
  - Conflict evidence – sources that deliberately contradict the correct answer.
- Each answer option was labeled for three safety dimensions:
  - High‑risk error (potentially harmful misdiagnosis).
  - Unsafe answer (overconfident but wrong).
  - Evidence contradiction (answer conflicts with supplied evidence).
Model pool
- 34 LLMs (parameter counts from ~300 M to >10 B) fine‑tuned on medical text and deployed locally.
Deployment conditions
- Closed‑book zero‑shot: plain prompt, no external context.
- Clean evidence: prompt includes clean references.
- Conflict evidence: prompt includes contradictory references.
- Standard RAG: retrieve top‑k passages from a generic medical corpus.
- Agentic RAG: retrieval guided by a “reasoning agent” that selects evidence iteratively.
- Max‑context prompting: feed the model the entire retrieved set (longest possible context).
Metrics
- Accuracy (correct answer selection).
- High‑risk error rate.
- Unsafe overconfidence (model expresses high confidence in a wrong answer).
- Evidence contradiction rate.
- Latency (time per inference).
Analysis
- Correlated model size and compute with each metric.
- Conducted worst‑case analysis to identify questions that consistently trigger safety failures.

Results & Findings

Deployment	Accuracy	High‑risk error	Contradiction	Dangerous overconfidence
Closed‑book (zero‑shot)	73.5 %	12.0 %	12.7 %	8.0 %
Clean evidence	94.1 %	2.6 %	2.3 %	1.6 %
Conflict evidence	78.3 %	10.5 %	11.9 %	6.9 %
Standard RAG	84.2 %	9.8 %	9.1 %	5.4 %
Agentic RAG	88.7 %	8.9 %	4.2 %	5.1 %
Max‑context	86.5 %	9.2 %	8.5 %	5.0 %

Key takeaways

Evidence quality trumps scale – clean, curated references yield the biggest safety jump, even for the smallest models.
Scaling alone yields diminishing returns – larger models improve accuracy modestly but do not close safety gaps.
Agentic RAG helps accuracy and contradictions but leaves high‑risk errors unchanged, suggesting that reasoning agents need better risk‑awareness.
Latency grows linearly with context length, yet safety does not improve proportionally.
Failure clustering – ~15 % of the questions account for >70 % of high‑risk errors, indicating a “long‑tail” of hard cases.

Practical Implications

Design‑by‑evidence: Deployments should prioritize feeding clean, clinician‑validated evidence rather than relying on raw model size or longer contexts.
RAG pipelines need safety filters: Simple retrieval augmentation is insufficient; post‑retrieval verification (e.g., contradiction detection) is essential.
Risk‑aware agents: When using agentic RAG, incorporate safety‑oriented reward signals (penalize high‑risk errors) to align the agent’s selection strategy with clinical safety.
Monitoring & targeted testing: Since failures concentrate on a small subset of queries, continuous monitoring of those “high‑risk” question patterns can catch regressions early.
Latency budgeting: Max‑context prompting is not a silver bullet; developers should balance response time against marginal safety gains, possibly using adaptive context windows.
Regulatory readiness: The SaFE‑Scale methodology offers a concrete, auditable safety benchmark that could satisfy emerging medical‑AI regulatory requirements.

Limitations & Future Work

Domain scope: The benchmark focuses on radiology; safety dynamics may differ in other specialties (e.g., pathology, primary care).
Static evidence: Clean evidence was pre‑selected by clinicians; real‑world systems must retrieve or generate such evidence on‑the‑fly, which introduces additional error sources.
Model diversity: All evaluated models were locally hosted; cloud‑based APIs with proprietary safety layers were not examined.
Safety dimensions: The study concentrates on high‑risk errors, contradictions, and overconfidence; other harms (e.g., privacy leakage, bias) remain unaddressed.
Future directions: Extending SaFE‑Scale to multi‑modal inputs (images + text), integrating automated evidence‑quality scoring, and exploring reinforcement‑learning‑from‑human‑feedback (RLHF) specifically tuned for clinical safety.

Authors

Sebastian Wind
Tri‑Thien Nguyen
Jeta Sopa
Mahshad Lotfinia
Sebastian Bickelhaup
Michael Uder
Harald Köstler
Gerhard Wellein
Sven Nebelung
Daniel Truhn
Andreas Maier
Soroosh Tayebi Arasteh

Paper Information

arXiv ID: 2605.04039v1
Categories: cs.CL, cs.AI, cs.LG
Published: May 5 2026
PDF: Download PDF

[Paper] Safety and accuracy follow different scaling laws in clinical large language models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims