[Paper] Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs
Source: arXiv - 2605.08045v1
Overview
The paper introduces CMR‑EXTR, a lightweight system that automatically transforms free‑text cardiac magnetic resonance (CMR) radiology reports into a clean, structured dataset—while also flagging how confident it is about each extracted field. By coupling a teacher‑student distillation pipeline with uncertainty modeling, the authors achieve near‑perfect extraction accuracy and give clinicians a practical way to spot‑check only the doubtful entries.
Key Contributions
- CMR‑specific extraction engine that converts narrative CMR reports into a predefined schema (e.g., ventricular volumes, ejection fractions, tissue characterization).
- Uncertainty‑aware scoring per field, derived from three complementary signals: distribution plausibility, sampling stability, and cross‑field consistency.
- Teacher‑student distillation workflow that leverages a large language model (LLM) as a “teacher” to generate high‑quality pseudo‑labels, then trains a compact “student” model for fast, offline inference.
- Empirical validation showing 99.65 % variable‑level accuracy on a real‑world CMR report corpus, with confidence scores that reliably separate correct from erroneous extractions.
- Open‑source release (GitHub) enabling easy adoption and extension to other imaging domains.
Methodology
- Data Preparation – A modest set of manually annotated CMR reports (≈1 k) is used to define the target schema and seed the system.
- Teacher Model – A powerful LLM (e.g., GPT‑4‑style) is prompted to extract each variable from the raw report, producing high‑quality “gold” labels without exhaustive human effort.
- Student Model – A lightweight transformer (≈30 M parameters) is trained on the teacher‑generated pseudo‑labels, learning to mimic the extraction behavior while being fast enough for on‑premise deployment.
- Uncertainty Modeling – For every extracted field, three scores are computed:
- Distribution plausibility: how likely the value is under the empirical distribution of that variable (e.g., a left‑ventricular ejection fraction of 200 % is implausible).
- Sampling stability: variance across multiple stochastic forward passes (Monte‑Carlo dropout) indicating model confidence.
- Cross‑field consistency: logical checks between related fields (e.g., end‑diastolic volume should be ≥ end‑systolic volume).
These scores are fused into a single confidence metric that can be thresholded to route uncertain entries to a human reviewer.
- Evaluation – Extraction accuracy is measured at the variable level, and the confidence scores are assessed for their ability to separate correct from incorrect predictions (ROC‑AUC).
Results & Findings
- Variable‑level accuracy: 99.65 % across 45 structured CMR variables, essentially matching manual abstraction quality.
- Confidence effectiveness: The combined uncertainty score yields an AUC of 0.97 for distinguishing correct vs. erroneous extractions, enabling a triage workflow that reduces manual review workload by >80 % while preserving >99 % overall data quality.
- Speed & footprint: The student model processes a report in <200 ms on a commodity CPU, making it suitable for batch processing of large hospital archives without cloud dependencies.
- Ablation studies: Removing any of the three uncertainty components degrades triage performance, confirming that distribution plausibility, stability, and consistency each contribute uniquely.
Practical Implications
- Rapid cohort building: Researchers can pull structured CMR phenotypes from legacy reports at scale, accelerating retrospective studies and multi‑center trials.
- Clinical decision support: Real‑time extraction pipelines can feed structured measurements into risk calculators or AI‑based treatment recommendation engines, with confidence flags ensuring clinicians only intervene when needed.
- Data governance: The per‑field confidence scores provide an auditable trail, satisfying regulatory requirements for data provenance and quality control in health systems.
- Cost‑effective deployment: Because the inference model is lightweight and runs offline, hospitals can integrate CMR‑EXTR into existing PACS/RIS workflows without incurring cloud compute fees or exposing PHI.
- Extensibility: The teacher‑student framework can be re‑trained on other imaging modalities (e.g., CT, MRI) or report types (e.g., echocardiography) with minimal additional annotation effort.
Limitations & Future Work
- Domain specificity: The current schema is tightly coupled to CMR reporting conventions; adapting to institutions with divergent terminology may require schema re‑definition and additional fine‑tuning.
- Reliance on pseudo‑labels: While teacher distillation reduces manual labeling, any systematic bias in the teacher’s outputs propagates to the student. Future work could incorporate human‑in‑the‑loop correction loops to mitigate this.
- Uncertainty calibration: Confidence scores are empirically effective but not formally calibrated; exploring Bayesian deep learning or conformal prediction could yield more theoretically grounded uncertainty estimates.
- Longitudinal consistency: The system processes reports independently; integrating temporal information across serial studies could improve detection of subtle measurement drifts or reporting errors.
CMR‑EXTR demonstrates that with clever use of LLMs and uncertainty modeling, extracting high‑quality structured data from free‑text radiology reports is no longer a research‑only problem—it’s ready for production pipelines that empower both clinicians and data scientists.
Authors
- Yi Yu
- Parker Martin
- Zhenyu Bu
- Yixuan Liu
- Yi‑Yu Zheng
- Orlando Simonetti
- Yuchi Han
- Yuan Xue
Paper Information
- arXiv ID: 2605.08045v1
- Categories: cs.CL
- Published: May 8, 2026
- PDF: Download PDF