[Paper] AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization
Source: arXiv - 2602.20040v1
Overview
The paper introduces AgenticSum, a novel inference‑time framework that turns a vanilla large language model (LLM) into a multi‑step “agent” for summarizing clinical notes. By breaking the summarization process into distinct, coordinated stages—context selection, draft generation, fact‑checking, and targeted correction—the authors dramatically reduce hallucinations while preserving the richness of medical information.
Key Contributions
- Agentic inference architecture that decomposes summarization into four interacting modules (selection, generation, verification, correction).
- Attention‑grounding signals to automatically flag low‑confidence spans in the draft without external knowledge bases.
- Supervisory correction loop that revises only the flagged content, keeping the rest of the draft intact.
- Comprehensive evaluation on two public clinical summarization datasets using automatic metrics, LLM‑as‑judge scoring, and human expert assessment.
- Demonstrated gains over standard LLM prompting and strong baselines, showing that structured inference can improve factual consistency without retraining the model.
Methodology
- Context Selection – The framework first compresses the raw clinical note by retrieving the most task‑relevant sentences using a lightweight similarity scorer (e.g., BM25 or a small embedding model). This reduces noise and keeps the LLM’s prompt within token limits.
- Draft Generation – The selected context is fed to a pre‑trained LLM (e.g., GPT‑4 or LLaMA‑2) with a prompt that asks for a concise clinical summary. The model produces an initial draft.
- Verification (Grounding) – While generating, the LLM’s internal attention weights are inspected to compute a grounding score for each output token. Low‑grounding spans are likely unsupported by the input and are automatically flagged.
- Targeted Correction – A second pass asks the LLM to rewrite only the flagged spans, providing the original context, the draft, and a supervisory instruction (“fix the unsupported statements”). The rest of the summary is left untouched, preserving fluency.
The pipeline runs entirely at inference time; no fine‑tuning or external fact‑checking databases are required.
Results & Findings
| Dataset | Metric (higher = better) | Vanilla LLM | AgenticSum |
|---|---|---|---|
| MIMIC‑III Summ | ROUGE‑L | 32.1 | 36.8 |
| Fact‑Score (LLM‑as‑judge) | 0.71 | 0.84 | |
| Human‑rated factuality (1‑5) | 3.2 | 4.1 | |
| i2b2‑2010 | ROUGE‑L | 28.7 | 33.4 |
| Fact‑Score | 0.68 | 0.80 |
- Consistent improvement across both datasets and all evaluation lenses.
- The verification‑correction loop cut hallucinated statements by ~45 % compared to the vanilla LLM.
- Human reviewers noted that the corrected summaries retained clinical nuance while being more trustworthy.
Practical Implications
- Safer clinical decision support: Summaries generated by AgenticSum can be fed directly into EHR dashboards or hand‑off tools, reducing the risk of misinformation that could affect patient care.
- Plug‑and‑play integration: Since the framework works at inference time, existing LLM APIs (OpenAI, Anthropic, etc.) can be wrapped with AgenticSum without retraining, making adoption straightforward for health‑tech startups.
- Token‑efficiency: By pruning irrelevant context early, the approach stays within model token limits, lowering API costs for large‑scale deployments.
- Generalizable pattern: The agentic decomposition (select‑generate‑verify‑correct) can be reused for other high‑stakes domains such as legal document summarization or scientific literature review.
Limitations & Future Work
- Reliance on attention grounding: While effective, attention scores are an indirect proxy for factual support and may miss subtle errors.
- Domain‑specific prompts: The current implementation uses handcrafted prompts; automating prompt synthesis could improve robustness across specialties.
- Scalability to multi‑document inputs: The paper focuses on single‑note summarization; extending the pipeline to aggregate information from multiple encounters remains an open challenge.
- Human-in-the-loop: Future work could explore semi‑automated correction where clinicians review flagged spans before final release, further tightening safety guarantees.
Authors
- Fahmida Liza Piya
- Rahmatollah Beheshti
Paper Information
- arXiv ID: 2602.20040v1
- Categories: cs.CL, cs.AI
- Published: February 23, 2026
- PDF: Download PDF