[Paper] Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent

Published: (December 23, 2025 at 01:32 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.20586v1

Overview

The paper introduces SAGE (Secure Agent for Generative Dose Expertise), a large‑language‑model (LLM)‑driven system that automatically creates stereotactic radiosurgery (SRS) treatment plans for brain metastases. By embedding chain‑of‑thought (CoT) reasoning into the model, the authors show that the system can match human planners in dosimetric quality while producing a transparent, auditable “thinking trace” that addresses clinicians’ trust concerns about black‑box AI.

Key Contributions

  • Human‑in‑the‑loop LLM agent: First SRS planning tool that couples an LLM with explicit reasoning steps (constraint checks, trade‑off deliberations).
  • Two model variants: A baseline “non‑reasoning” LLM and a CoT‑enabled “reasoning” LLM, allowing a direct head‑to‑head performance comparison.
  • Dosimetric parity: The reasoning variant achieved statistically indistinguishable coverage, maximum dose, conformity index, and gradient index compared with expert human plans.
  • Improved organ‑at‑risk (OAR) sparing: Notably reduced cochlear dose (p = 0.022) relative to human baselines.
  • Auditable planning logs: The reasoning model generated 457 constraint‑verification events and 609 trade‑off deliberations per case, providing a traceable decision‑making record.
  • Retrospective validation: Tested on a cohort of 41 patients with single‑fraction 18 Gy SRS, demonstrating feasibility in a realistic clinical dataset.

Methodology

  1. Data collection – 41 de‑identified brain‑metastasis cases previously treated with manual SRS plans were assembled, including target volumes (PTV) and critical‑structure contours.
  2. Prompt engineering – Two prompt templates were crafted:
    • Non‑reasoning: Directly asks the LLM to output dose‑distribution parameters.
    • Reasoning: Inserts a chain‑of‑thought scaffold that forces the model to (a) list all relevant constraints, (b) verify each against the current draft plan, and (c) explicitly discuss trade‑offs before finalizing values.
  3. LLM backbone – Both variants used the same underlying large language model (e.g., GPT‑4‑style), differing only in the reasoning prompt.
  4. Plan synthesis – The LLM outputs a set of optimization objectives (e.g., dose limits, weighting factors) that are fed into a conventional treatment‑planning optimizer (the same engine used by human planners).
  5. Evaluation metrics – Standard SRS dosimetric endpoints were measured: PTV coverage (V100%), maximum dose (Dmax), conformity index (CI), gradient index (GI), and OAR doses (e.g., cochlea, optic apparatus). Statistical significance was assessed with paired t‑tests.
  6. Content analysis – Generated logs were parsed to count occurrences of constraint verification and causal explanations, comparing the two model variants.

Results & Findings

MetricHuman PlannerReasoning LLMNon‑Reasoning LLM
PTV coverage (V100%)99.2 %99.1 % (p > 0.21)97.8 % (p < 0.05)
Maximum dose (Dmax)20.5 Gy20.4 Gy (p > 0.21)21.1 Gy (p < 0.05)
Conformity Index1.121.13 (p > 0.21)1.18 (p < 0.05)
Gradient Index3.43.5 (p > 0.21)3.8 (p < 0.05)
Cochlear dose4.2 Gy3.5 Gy (p = 0.022)4.3 Gy (ns)
  • The reasoning LLM matched human planners on all primary endpoints; the non‑reasoning LLM fell short on several.
  • When asked to “improve conformity,” the reasoning model systematically performed 457 constraint‑verification steps and 609 trade‑off deliberations, whereas the baseline model showed virtually none (0 and 7 respectively).
  • Qualitative analysis revealed that the reasoning trace contained explicit causal explanations (e.g., “Increasing the gradient weight will reduce dose spill to the optic chiasm but may lower PTV coverage”), which are absent in the baseline output.

Practical Implications

  • Accelerated planning workflow: Clinics could generate high‑quality SRS plans in minutes, freeing physicists to focus on verification and patient‑specific nuances.
  • Transparency for regulatory compliance: The auditable reasoning log satisfies a key hurdle for AI adoption in radiation oncology—providing a human‑readable justification for every optimization decision.
  • Scalable expertise: Smaller centers lacking seasoned dosimetrists could leverage SAGE to achieve plan quality comparable to high‑volume academic sites.
  • Integration path: Since SAGE outputs standard optimizer parameters, it can be dropped into existing treatment‑planning systems (e.g., Eclipse, RayStation) without major software overhauls.
  • Potential for continuous learning: The reasoning traces can be harvested to fine‑tune the LLM or to train downstream supervised models that predict optimal constraint hierarchies for new cases.

Limitations & Future Work

  • Retrospective, single‑institution dataset: Validation on 41 cases limits generalizability; multi‑center prospective trials are needed.
  • LLM hallucination risk: Although reasoning reduces errors, the model can still fabricate constraints or misinterpret anatomy; robust guardrails (e.g., rule‑based verification) are required.
  • Hardware and latency: Real‑time inference with large LLMs may demand GPU clusters, which could be a barrier for some clinics.
  • Extension to multi‑fraction or non‑brain sites: The current study focuses on single‑fraction brain SRS; adapting the framework to other anatomical sites or fractionation schemes remains an open challenge.
  • User‑interface design: Translating the reasoning trace into an intuitive UI for clinicians is essential for adoption but was not explored in this work.

Bottom line: By marrying chain‑of‑thought prompting with a conventional dose‑optimization engine, SAGE demonstrates that LLMs can be both effective and transparent in a high‑stakes medical domain, opening the door for broader AI‑assisted treatment planning in radiation oncology.

Authors

  • Humza Nusrat
  • Luke Francisco
  • Bing Luo
  • Hassan Bagher‑Ebadian
  • Joshua Kim
  • Karen Chin‑Snyder
  • Salim Siddiqui
  • Mira Shah
  • Eric Mellon
  • Mohammad Ghassemi
  • Anthony Doemer
  • Benjamin Movsas
  • Kundan Thind

Paper Information

  • arXiv ID: 2512.20586v1
  • Categories: cs.AI, cs.CL, cs.HC
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »