[Paper] Safety and accuracy follow different scaling laws in clinical large language models
Source: arXiv - 2605.04039v1
Overview
The paper investigates how safety and accuracy evolve as clinical large language models (LLMs) are scaled up. By introducing a systematic evaluation framework (SaFE‑Scale) and a radiology‑focused benchmark (RadSaFE‑200), the authors show that bigger models or more compute do not automatically yield safer behavior—deployment choices such as evidence quality and retrieval strategy matter far more.
Key Contributions
- SaFE‑Scale framework: a reproducible methodology to assess safety across model size, context length, retrieval complexity, and inference‑time compute.
- RadSaFE‑200 benchmark: 200 radiology multiple‑choice questions annotated with clean evidence, conflicting evidence, and fine‑grained safety labels (high‑risk error, unsafe answer, evidence contradiction).
- Comprehensive empirical study: evaluation of 34 locally‑deployed LLMs under six deployment conditions (zero‑shot, clean evidence, conflict evidence, standard RAG, agentic RAG, max‑context prompting).
- Empirical finding: high‑quality (clean) evidence dramatically improves both accuracy (↑ 20 pp) and safety metrics (high‑risk errors ↓ 9.4 pp, contradictions ↓ 10.4 pp, dangerous overconfidence ↓ 6.4 pp).
- Insight on retrieval designs: standard RAG and agentic RAG do not inherit the safety gains of clean evidence; agentic RAG reduces contradictions but leaves high‑risk errors high.
- Latency vs. safety trade‑off: max‑context prompting inflates inference latency without closing the safety gap.
- Worst‑case analysis: clinically consequential failures cluster in a small subset of questions, highlighting the need for targeted safeguards.
Methodology
-
Benchmark construction (RadSaFE‑200)
- Curated 200 radiology MCQs from board‑style exams.
- For each question, clinicians supplied:
- Clean evidence – unambiguous, high‑quality references.
- Conflict evidence – sources that deliberately contradict the correct answer.
- Each answer option was labeled for three safety dimensions:
- High‑risk error (potentially harmful misdiagnosis).
- Unsafe answer (overconfident but wrong).
- Evidence contradiction (answer conflicts with supplied evidence).
-
Model pool
- 34 LLMs (parameter counts from ~300 M to >10 B) fine‑tuned on medical text and deployed locally.
-
Deployment conditions
- Closed‑book zero‑shot: plain prompt, no external context.
- Clean evidence: prompt includes clean references.
- Conflict evidence: prompt includes contradictory references.
- Standard RAG: retrieve top‑k passages from a generic medical corpus.
- Agentic RAG: retrieval guided by a “reasoning agent” that selects evidence iteratively.
- Max‑context prompting: feed the model the entire retrieved set (longest possible context).
-
Metrics
- Accuracy (correct answer selection).
- High‑risk error rate.
- Unsafe overconfidence (model expresses high confidence in a wrong answer).
- Evidence contradiction rate.
- Latency (time per inference).
-
Analysis
- Correlated model size and compute with each metric.
- Conducted worst‑case analysis to identify questions that consistently trigger safety failures.
Results & Findings
| Deployment | Accuracy | High‑risk error | Contradiction | Dangerous overconfidence |
|---|---|---|---|---|
| Closed‑book (zero‑shot) | 73.5 % | 12.0 % | 12.7 % | 8.0 % |
| Clean evidence | 94.1 % | 2.6 % | 2.3 % | 1.6 % |
| Conflict evidence | 78.3 % | 10.5 % | 11.9 % | 6.9 % |
| Standard RAG | 84.2 % | 9.8 % | 9.1 % | 5.4 % |
| Agentic RAG | 88.7 % | 8.9 % | 4.2 % | 5.1 % |
| Max‑context | 86.5 % | 9.2 % | 8.5 % | 5.0 % |
Key takeaways
- Evidence quality trumps scale – clean, curated references yield the biggest safety jump, even for the smallest models.
- Scaling alone yields diminishing returns – larger models improve accuracy modestly but do not close safety gaps.
- Agentic RAG helps accuracy and contradictions but leaves high‑risk errors unchanged, suggesting that reasoning agents need better risk‑awareness.
- Latency grows linearly with context length, yet safety does not improve proportionally.
- Failure clustering – ~15 % of the questions account for >70 % of high‑risk errors, indicating a “long‑tail” of hard cases.
Practical Implications
- Design‑by‑evidence: Deployments should prioritize feeding clean, clinician‑validated evidence rather than relying on raw model size or longer contexts.
- RAG pipelines need safety filters: Simple retrieval augmentation is insufficient; post‑retrieval verification (e.g., contradiction detection) is essential.
- Risk‑aware agents: When using agentic RAG, incorporate safety‑oriented reward signals (penalize high‑risk errors) to align the agent’s selection strategy with clinical safety.
- Monitoring & targeted testing: Since failures concentrate on a small subset of queries, continuous monitoring of those “high‑risk” question patterns can catch regressions early.
- Latency budgeting: Max‑context prompting is not a silver bullet; developers should balance response time against marginal safety gains, possibly using adaptive context windows.
- Regulatory readiness: The SaFE‑Scale methodology offers a concrete, auditable safety benchmark that could satisfy emerging medical‑AI regulatory requirements.
Limitations & Future Work
- Domain scope: The benchmark focuses on radiology; safety dynamics may differ in other specialties (e.g., pathology, primary care).
- Static evidence: Clean evidence was pre‑selected by clinicians; real‑world systems must retrieve or generate such evidence on‑the‑fly, which introduces additional error sources.
- Model diversity: All evaluated models were locally hosted; cloud‑based APIs with proprietary safety layers were not examined.
- Safety dimensions: The study concentrates on high‑risk errors, contradictions, and overconfidence; other harms (e.g., privacy leakage, bias) remain unaddressed.
- Future directions: Extending SaFE‑Scale to multi‑modal inputs (images + text), integrating automated evidence‑quality scoring, and exploring reinforcement‑learning‑from‑human‑feedback (RLHF) specifically tuned for clinical safety.
Authors
- Sebastian Wind
- Tri‑Thien Nguyen
- Jeta Sopa
- Mahshad Lotfinia
- Sebastian Bickelhaup
- Michael Uder
- Harald Köstler
- Gerhard Wellein
- Sven Nebelung
- Daniel Truhn
- Andreas Maier
- Soroosh Tayebi Arasteh
Paper Information
- arXiv ID: 2605.04039v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: May 5 2026
- PDF: Download PDF