[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

Published: (January 16, 2026 at 01:09 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.11488v1

Overview

The paper introduces CTest‑Metric, the first unified framework for evaluating how well automatic metrics reflect the clinical quality of CT radiology report generation (RRG) systems. By rigorously testing metric robustness against style changes, synthetic errors, and real‑world expert judgments, the authors provide a practical toolkit that helps developers choose—or design—metrics that truly matter in a medical setting.

Key Contributions

  • Unified assessment pipeline with three complementary modules:
    1. Writing Style Generalizability (WSG) – measures metric stability when reports are re‑phrased by large language models (LLMs).
    2. Synthetic Error Injection (SEI) – injects graded factual errors (e.g., wrong anatomy, missed findings) to test metric sensitivity.
    3. Metrics‑vs‑Expert correlation (MvE) – compares metric scores against radiologist ratings on 175 “disagreement” cases.
  • Comprehensive benchmark of eight popular NLG and clinical metrics (BLEU, ROUGE, METEOR, BERTScore‑F1, F1‑RadGraph, RaTEScore, GREEN Score, CRG) using seven LLMs built on a CT‑CLIP encoder.
  • Empirical insights: lexical metrics (BLEU, ROUGE) are brittle to stylistic variation; GREEN Score shows the strongest alignment with expert opinion (Spearman ≈ 0.70); CRG surprisingly correlates negatively; BERTScore‑F1 is the most tolerant to injected factual errors.
  • Open‑source release of the framework, code, and a curated subset of re‑phrased / error‑injected reports to enable reproducible benchmarking.

Methodology

  1. Dataset preparation – The authors start from a collection of CT reports and generate three derived corpora:
    • Re‑phrased versions created by prompting seven different LLMs (e.g., GPT‑4, LLaMA‑2) to rewrite the same content while preserving meaning.
    • Error‑injected reports where controlled mistakes (e.g., swapping “no fracture” → “fracture present”) are introduced at low, medium, and high severity levels.
    • Expert‑rated pairs where board‑certified radiologists assign a clinical quality score, focusing on cases where automatic metrics and human judgments diverge.
  2. Metric evaluation – Each of the eight candidate metrics is run on the original vs. transformed reports, producing a similarity score.
  3. Three‑module analysis:
    • WSG computes the variance of metric scores across the different LLM re‑phrasings. Low variance → high style robustness.
    • SEI measures how quickly a metric’s score drops as error severity rises, indicating factual sensitivity.
    • MvE calculates Spearman correlation between metric scores and radiologist ratings on the disagreement set.
  4. Statistical aggregation – Results are averaged across LLMs and error levels, and significance testing is performed to rank metrics.

Results & Findings

ModuleKey Observation
WSGLexical overlap metrics (BLEU, ROUGE, METEOR) exhibit >30 % score fluctuation across LLM styles, making them unreliable when report phrasing varies. Embedding‑based scores (BERTScore‑F1, GREEN) are far more stable.
SEIBERTScore‑F1 shows the shallowest decline, suggesting it tolerates minor factual slips—a double‑edged sword for safety‑critical use. GREEN Score’s decline is proportional to error severity, indicating good factual awareness.
MvEGREEN Score achieves the highest Spearman correlation (≈ 0.70) with radiologist judgments, outperforming traditional NLG metrics by a large margin. CRG, despite being a clinical‑specific metric, correlates negatively (≈ ‑0.25), hinting at design flaws or mis‑alignment with radiologist priorities.
Overall rankingGREEN > BERTScore‑F1 > F1‑RadGraph > RaTEScore > BLEU/ROUGE/METEOR > CRG.

These findings suggest that semantic‑aware, clinically‑grounded metrics are far more trustworthy for CT report generation than pure surface‑form similarity measures.

Practical Implications

  • Metric selection for product teams – Developers building RRG pipelines can replace BLEU/ROUGE with GREEN Score to obtain a more clinically meaningful performance signal, reducing the risk of “optimizing the wrong metric.”
  • Model debugging – The SEI module can be used as a stress test: inject synthetic errors into model outputs and see whether your chosen metric flags them, helping catch subtle factual regressions before deployment.
  • Continuous evaluation pipelines – By integrating the WSG test, teams can ensure their evaluation remains robust when downstream LLMs (e.g., for report post‑processing) are swapped or fine‑tuned, avoiding metric drift.
  • Regulatory & safety compliance – Since GREEN Score aligns closely with radiologist assessments, it can serve as an objective artifact in documentation for FDA or CE submissions, demonstrating that the AI system’s outputs meet clinical quality standards.
  • Benchmarking community – The open‑source framework gives startups and research labs a common yardstick, fostering fair competition and accelerating the emergence of truly clinically useful RRG models.

Limitations & Future Work

  • Scope limited to CT reports – The framework is built around CT‑specific language and imaging findings; extending to MRI, X‑ray, or multimodal reports will require additional domain adaptation.
  • Synthetic errors may not capture all real‑world failure modes – While SEI covers common factual mistakes, rare edge cases (e.g., rare pathologies) remain untested.
  • Expert rating size – The MvE analysis relies on 175 disagreement cases; a larger, more diverse radiologist panel could improve correlation reliability.
  • Metric diversity – Only eight metrics were evaluated; future work could incorporate newer foundation‑model‑based evaluators (e.g., Med-PaLM‑2 scoring) and assess their alignment.

The authors plan to broaden the dataset, incorporate additional imaging modalities, and explore automated error‑generation techniques that better mimic real clinical mistakes.

Authors

  • Vanshali Sharma
  • Andrea Mia Bejar
  • Gorkem Durak
  • Ulas Bagci

Paper Information

  • arXiv ID: 2601.11488v1
  • Categories: cs.CL, cs.CV
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »