[Paper] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

Published: (February 16, 2026 at 11:10 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.14879v1

Overview

The paper introduces CT‑Bench, the first publicly released benchmark that pairs CT images of lesions with rich, lesion‑level metadata (bounding boxes, textual descriptions, size measurements) and a multimodal visual‑question‑answering (VQA) suite. By filling the long‑standing gap of annotated CT data, the authors enable systematic evaluation and fine‑tuning of AI models that must both “see” a lesion and “talk” about it—an essential step toward trustworthy, assistive radiology tools.

Key Contributions

  • A large, curated lesion dataset – 20,335 lesions from 7,795 CT studies, each annotated with a bounding box, free‑form description, and quantitative size information.
  • A multimodal VQA benchmark – 2,850 question‑answer pairs covering four core tasks: lesion localization, textual description, size estimation, and attribute categorization (e.g., solid vs. cystic).
  • Hard‑negative examples – deliberately confusing non‑lesion regions and ambiguous cases to mimic real diagnostic difficulty.
  • Comprehensive evaluation – state‑of‑the‑art vision‑language models (including medical CLIP variants) are benchmarked against radiologist performance on both the image‑metadata and VQA tasks.
  • Demonstrated fine‑tuning gains – models pre‑trained on generic image‑text data achieve sizable improvements after fine‑tuning on CT‑Bench, confirming the dataset’s clinical relevance.

Methodology

  1. Data collection & annotation
    • Retrospective CT scans were sourced from multiple institutions, de‑identified, and split into training/validation/test sets.
    • Board‑certified radiologists drew axis‑aligned bounding boxes around each lesion, wrote concise natural‑language descriptions (e.g., “spiculated mass in the right upper lobe”), and recorded the longest axial diameter.
  2. VQA construction
    • For every lesion, four question templates were instantiated (e.g., “Where is the lesion located?”, “What is its size?”).
    • Answers were either coordinates (for localization), free‑text (for description), numeric values (for size), or categorical labels (for attributes).
    • Hard negatives were generated by pairing lesions with mismatched questions or by selecting nearby non‑lesion patches.
  3. Model evaluation
    • Baseline vision‑language architectures (ViLT, BLIP, MedCLIP) were trained on the image‑metadata task (bounding‑box regression + description generation) and on the VQA task (question encoding + answer prediction).
    • Performance was measured against a panel of radiologists using standard metrics: IoU for localization, BLEU/ROUGE for description, mean absolute error for size, and accuracy for attribute classification.

Results & Findings

TaskRadiologist Avg.Best Model (MedCLIP‑FT)Gap
Lesion Localization (IoU)0.780.710.07
Description Generation (BLEU‑4)0.620.550.07
Size Estimation (MAE, mm)2.13.4+1.3
Attribute Categorization (Acc.)0.880.810.07
  • Fine‑tuning on CT‑Bench consistently narrowed the gap across all tasks (average improvement of 12 % relative to off‑the‑shelf models).
  • Hard‑negative cases caused the biggest performance drops, highlighting that current models still struggle with subtle visual cues that radiologists handle intuitively.
  • Cross‑task transfer: models trained on the image‑metadata task transferred well to VQA, suggesting that learning lesion‑level visual representations benefits downstream reasoning.

Practical Implications

  • Accelerated development of AI‑assisted reporting – developers can now train and benchmark models that automatically generate lesion descriptions and size estimates, reducing radiologists’ dictation workload.
  • Improved triage systems – accurate lesion localization and attribute classification enable automated flagging of high‑risk findings (e.g., spiculated nodules) for priority review.
  • Foundation for multimodal clinical decision support – the VQA format mirrors real‑world queries (“Is this lesion larger than 1 cm?”), paving the way for conversational AI assistants that can answer radiology‑specific questions on the fly.
  • Benchmark‑driven research – open access to CT‑Bench encourages reproducibility and fair comparison, fostering community‑wide progress on multimodal medical vision models.

Limitations & Future Work

  • Dataset diversity – while the collection spans several hospitals, it is still biased toward adult thoracic CTs; abdominal or pediatric lesions are under‑represented.
  • Annotation granularity – bounding boxes are coarse; pixel‑level segmentation could enable more precise size and shape analysis.
  • Answer scope – the VQA set focuses on four task categories; expanding to differential diagnosis or treatment recommendation questions would broaden clinical relevance.
  • Model generalization – current experiments show performance drops when models are tested on external institutions not seen during fine‑tuning, indicating a need for domain‑robust training strategies.

Overall, CT‑Bench marks a pivotal step toward truly multimodal AI that can both see and describe lesions in CT scans, offering a practical platform for developers to build the next generation of radiology assistants.

Authors

  • Qingqing Zhu
  • Qiao Jin
  • Tejas S. Mathai
  • Yin Fang
  • Zhizheng Wang
  • Yifan Yang
  • Maame Sarfo-Gyamfi
  • Benjamin Hou
  • Ran Gu
  • Praveen T. S. Balamuralikrishna
  • Kenneth C. Wang
  • Ronald M. Summers
  • Zhiyong Lu

Paper Information

  • arXiv ID: 2602.14879v1
  • Categories: cs.CV, cs.AI
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »