[Paper] Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

Published: 3 days ago (May 8, 2026 at 01:35 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.08045v1

Overview

The paper introduces CMR‑EXTR, a lightweight system that automatically transforms free‑text cardiac magnetic resonance (CMR) radiology reports into a clean, structured dataset—while also flagging how confident it is about each extracted field. By coupling a teacher‑student distillation pipeline with uncertainty modeling, the authors achieve near‑perfect extraction accuracy and give clinicians a practical way to spot‑check only the doubtful entries.

Key Contributions

CMR‑specific extraction engine that converts narrative CMR reports into a predefined schema (e.g., ventricular volumes, ejection fractions, tissue characterization).
Uncertainty‑aware scoring per field, derived from three complementary signals: distribution plausibility, sampling stability, and cross‑field consistency.
Teacher‑student distillation workflow that leverages a large language model (LLM) as a “teacher” to generate high‑quality pseudo‑labels, then trains a compact “student” model for fast, offline inference.
Empirical validation showing 99.65 % variable‑level accuracy on a real‑world CMR report corpus, with confidence scores that reliably separate correct from erroneous extractions.
Open‑source release (GitHub) enabling easy adoption and extension to other imaging domains.

Methodology

Data Preparation – A modest set of manually annotated CMR reports (≈1 k) is used to define the target schema and seed the system.
Teacher Model – A powerful LLM (e.g., GPT‑4‑style) is prompted to extract each variable from the raw report, producing high‑quality “gold” labels without exhaustive human effort.
Student Model – A lightweight transformer (≈30 M parameters) is trained on the teacher‑generated pseudo‑labels, learning to mimic the extraction behavior while being fast enough for on‑premise deployment.
Uncertainty Modeling – For every extracted field, three scores are computed:
- Distribution plausibility: how likely the value is under the empirical distribution of that variable (e.g., a left‑ventricular ejection fraction of 200 % is implausible).
- Sampling stability: variance across multiple stochastic forward passes (Monte‑Carlo dropout) indicating model confidence.
- Cross‑field consistency: logical checks between related fields (e.g., end‑diastolic volume should be ≥ end‑systolic volume).
  These scores are fused into a single confidence metric that can be thresholded to route uncertain entries to a human reviewer.
Evaluation – Extraction accuracy is measured at the variable level, and the confidence scores are assessed for their ability to separate correct from incorrect predictions (ROC‑AUC).

Results & Findings

Variable‑level accuracy: 99.65 % across 45 structured CMR variables, essentially matching manual abstraction quality.
Confidence effectiveness: The combined uncertainty score yields an AUC of 0.97 for distinguishing correct vs. erroneous extractions, enabling a triage workflow that reduces manual review workload by >80 % while preserving >99 % overall data quality.
Speed & footprint: The student model processes a report in <200 ms on a commodity CPU, making it suitable for batch processing of large hospital archives without cloud dependencies.
Ablation studies: Removing any of the three uncertainty components degrades triage performance, confirming that distribution plausibility, stability, and consistency each contribute uniquely.

Practical Implications

Rapid cohort building: Researchers can pull structured CMR phenotypes from legacy reports at scale, accelerating retrospective studies and multi‑center trials.
Clinical decision support: Real‑time extraction pipelines can feed structured measurements into risk calculators or AI‑based treatment recommendation engines, with confidence flags ensuring clinicians only intervene when needed.
Data governance: The per‑field confidence scores provide an auditable trail, satisfying regulatory requirements for data provenance and quality control in health systems.
Cost‑effective deployment: Because the inference model is lightweight and runs offline, hospitals can integrate CMR‑EXTR into existing PACS/RIS workflows without incurring cloud compute fees or exposing PHI.
Extensibility: The teacher‑student framework can be re‑trained on other imaging modalities (e.g., CT, MRI) or report types (e.g., echocardiography) with minimal additional annotation effort.

Limitations & Future Work

Domain specificity: The current schema is tightly coupled to CMR reporting conventions; adapting to institutions with divergent terminology may require schema re‑definition and additional fine‑tuning.
Reliance on pseudo‑labels: While teacher distillation reduces manual labeling, any systematic bias in the teacher’s outputs propagates to the student. Future work could incorporate human‑in‑the‑loop correction loops to mitigate this.
Uncertainty calibration: Confidence scores are empirically effective but not formally calibrated; exploring Bayesian deep learning or conformal prediction could yield more theoretically grounded uncertainty estimates.
Longitudinal consistency: The system processes reports independently; integrating temporal information across serial studies could improve detection of subtle measurement drifts or reporting errors.

CMR‑EXTR demonstrates that with clever use of LLMs and uncertainty modeling, extracting high‑quality structured data from free‑text radiology reports is no longer a research‑only problem—it’s ready for production pipelines that empower both clinicians and data scientists.

Authors

Yi Yu
Parker Martin
Zhenyu Bu
Yixuan Liu
Yi‑Yu Zheng
Orlando Simonetti
Yuchi Han
Yuan Xue

Paper Information

arXiv ID: 2605.08045v1
Categories: cs.CL
Published: May 8, 2026
PDF: Download PDF

[Paper] Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

[Paper] Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation