[Paper] AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization

Published: 3 days ago (February 23, 2026 at 11:49 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.20040v1

Overview

The paper introduces AgenticSum, a novel inference‑time framework that turns a vanilla large language model (LLM) into a multi‑step “agent” for summarizing clinical notes. By breaking the summarization process into distinct, coordinated stages—context selection, draft generation, fact‑checking, and targeted correction—the authors dramatically reduce hallucinations while preserving the richness of medical information.

Key Contributions

Agentic inference architecture that decomposes summarization into four interacting modules (selection, generation, verification, correction).
Attention‑grounding signals to automatically flag low‑confidence spans in the draft without external knowledge bases.
Supervisory correction loop that revises only the flagged content, keeping the rest of the draft intact.
Comprehensive evaluation on two public clinical summarization datasets using automatic metrics, LLM‑as‑judge scoring, and human expert assessment.
Demonstrated gains over standard LLM prompting and strong baselines, showing that structured inference can improve factual consistency without retraining the model.

Methodology

Context Selection – The framework first compresses the raw clinical note by retrieving the most task‑relevant sentences using a lightweight similarity scorer (e.g., BM25 or a small embedding model). This reduces noise and keeps the LLM’s prompt within token limits.
Draft Generation – The selected context is fed to a pre‑trained LLM (e.g., GPT‑4 or LLaMA‑2) with a prompt that asks for a concise clinical summary. The model produces an initial draft.
Verification (Grounding) – While generating, the LLM’s internal attention weights are inspected to compute a grounding score for each output token. Low‑grounding spans are likely unsupported by the input and are automatically flagged.
Targeted Correction – A second pass asks the LLM to rewrite only the flagged spans, providing the original context, the draft, and a supervisory instruction (“fix the unsupported statements”). The rest of the summary is left untouched, preserving fluency.

The pipeline runs entirely at inference time; no fine‑tuning or external fact‑checking databases are required.

Results & Findings

Dataset	Metric (higher = better)	Vanilla LLM	AgenticSum
MIMIC‑III Summ	ROUGE‑L	32.1	36.8
	Fact‑Score (LLM‑as‑judge)	0.71	0.84
	Human‑rated factuality (1‑5)	3.2	4.1
i2b2‑2010	ROUGE‑L	28.7	33.4
	Fact‑Score	0.68	0.80

Consistent improvement across both datasets and all evaluation lenses.
The verification‑correction loop cut hallucinated statements by ~45 % compared to the vanilla LLM.
Human reviewers noted that the corrected summaries retained clinical nuance while being more trustworthy.

Practical Implications

Safer clinical decision support: Summaries generated by AgenticSum can be fed directly into EHR dashboards or hand‑off tools, reducing the risk of misinformation that could affect patient care.
Plug‑and‑play integration: Since the framework works at inference time, existing LLM APIs (OpenAI, Anthropic, etc.) can be wrapped with AgenticSum without retraining, making adoption straightforward for health‑tech startups.
Token‑efficiency: By pruning irrelevant context early, the approach stays within model token limits, lowering API costs for large‑scale deployments.
Generalizable pattern: The agentic decomposition (select‑generate‑verify‑correct) can be reused for other high‑stakes domains such as legal document summarization or scientific literature review.

Limitations & Future Work

Reliance on attention grounding: While effective, attention scores are an indirect proxy for factual support and may miss subtle errors.
Domain‑specific prompts: The current implementation uses handcrafted prompts; automating prompt synthesis could improve robustness across specialties.
Scalability to multi‑document inputs: The paper focuses on single‑note summarization; extending the pipeline to aggregate information from multiple encounters remains an open challenge.
Human-in-the-loop: Future work could explore semi‑automated correction where clinicians review flagged spans before final release, further tightening safety guarantees.

Authors

Fahmida Liza Piya
Rahmatollah Beheshti

Paper Information

arXiv ID: 2602.20040v1
Categories: cs.CL, cs.AI
Published: February 23, 2026
PDF: Download PDF

[Paper] AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Dynamic Personality Adaptation in Large Language Models via State Machines

[Paper] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models