[Paper] Training and Evaluation of Guideline-Based Medical Reasoning in LLMs
Source: arXiv - 2512.03838v1
Overview
This paper tackles a gap in medical AI: while large language models (LLMs) can predict outcomes like sepsis, they often do so without transparent, guideline‑driven reasoning that clinicians trust. The authors show how to fine‑tune LLMs on verbalized consensus guidelines (e.g., the Sepsis‑3 definition) so the models can explain each step of their decision‑making and be automatically evaluated for both logical correctness and prediction accuracy.
Key Contributions
- Guideline‑based fine‑tuning: Introduced a pipeline that converts clinical consensus rules into natural‑language “reasoning traces” and uses them to fine‑tune LLMs.
- Dual‑level evaluation: Defined two metrics—derivation correctness (does the model follow the rule logic?) and value correctness (how close are the predicted clinical values to reality?).
- Empirical advantage of small models: Demonstrated that modest‑sized, fine‑tuned models outperform much larger, one‑shot prompted LLMs on guideline adherence.
- Multimodal integration: Combined LLM reasoning with a time‑series forecasting model to improve predictions for sparsely sampled clinical variables.
- Generalization insight: Showed that once a model learns a guideline, the main challenge shifts from out‑of‑distribution reasoning to forecasting future clinical measurements.
Methodology
- Rule verbalization: The authors took the Sepsis‑3 consensus definition—a set of conditional statements about vital signs, lab results, and organ dysfunction—and rewrote each rule as a natural‑language premise‑conclusion pair (e.g., “If lactate > 2 mmol/L and vasopressor use, then suspect septic shock”).
- Dataset creation: They instantiated these verbalized rules on real electronic health record (EHR) snapshots, generating thousands of reasoning traces that include both the rule application and the resulting clinical label.
- Fine‑tuning: A base LLM (e.g., LLaMA‑7B) was fine‑tuned on this synthetic‑plus‑real data, teaching it to reproduce the step‑by‑step reasoning and output the final diagnosis.
- Evaluation framework:
- Derivation correctness is measured by checking whether the model’s intermediate steps match the ground‑truth rule chain.
- Value correctness compares the model’s predicted numeric values (e.g., SOFA score) against the actual measurements in the EHR.
- Multimodal extension: A separate time‑series forecaster predicts missing future vitals; its hidden representations are fed into the LLM, allowing the language model to reason with both current and forecasted data.
Results & Findings
| Model | Size | Derivation Correctness | Value Correctness (AUROC) |
|---|---|---|---|
| Fine‑tuned LLaMA‑7B (rule data) | 7 B | ≈ 99 % on unseen patients | 0.88 |
| Prompted GPT‑4 (one‑shot) | 175 B | 71 % | 0.81 |
| Baseline fine‑tuned on medical text only | 7 B | 84 % | 0.79 |
- Rule adherence: Small fine‑tuned models nearly perfectly replicate the Sepsis‑3 logic on patients they have never seen.
- Prediction quality: Despite being smaller, they achieve higher AUROC than massive, prompt‑only models.
- Forecasting boost: Adding the time‑series forecaster lifts AUROC by ~0.03 and reduces the number of missed early sepsis cases.
- Bottleneck shift: Once the reasoning is trustworthy, the limiting factor becomes accurate forecasting of irregularly sampled clinical variables, not the model’s ability to apply guidelines.
Practical Implications
- Explainable AI for clinicians: Deployable LLMs can now output a human‑readable chain of guideline‑based reasoning, making it easier for doctors to validate and act on AI suggestions.
- Cost‑effective deployment: Organizations can achieve high‑quality, trustworthy predictions with relatively small models, reducing compute costs and latency compared to using giant LLM APIs.
- Rapid adaptation to new guidelines: By simply verbalizing updated consensus statements, the same fine‑tuning pipeline can keep AI systems in sync with evolving medical standards.
- Multimodal pipelines: Integrating a lightweight forecasting model (e.g., a Temporal Convolutional Network) with an LLM provides a practical architecture for real‑time monitoring systems in ICUs or emergency departments.
- Regulatory friendliness: Transparent derivation correctness aligns with emerging AI‑in‑health regulations that demand traceable decision logic.
Limitations & Future Work
- Guideline coverage: The study focuses on Sepsis‑3; extending to other specialties will require substantial rule verbalization effort.
- Data quality: Verbalized rule instances depend on accurate EHR extraction; noisy or missing fields can degrade fine‑tuning.
- Temporal generalization: Forecasting irregular clinical variables remains a challenge; more sophisticated time‑series models or data imputation strategies are needed.
- Human‑in‑the‑loop validation: The paper evaluates logical correctness automatically, but real‑world adoption will require clinician studies to confirm trust and usability.
- Scalability to multimodal data: Future work could explore richer modalities (e.g., imaging, waveforms) alongside textual guidelines for a truly holistic clinical AI.
Authors
- Michael Staniek
- Artem Sokolov
- Stefan Riezler
Paper Information
- arXiv ID: 2512.03838v1
- Categories: cs.CL
- Published: December 3, 2025
- PDF: Download PDF