[Paper] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

Published: 3 days ago (February 16, 2026 at 11:10 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.14879v1

Overview

The paper introduces CT‑Bench, the first publicly released benchmark that pairs CT images of lesions with rich, lesion‑level metadata (bounding boxes, textual descriptions, size measurements) and a multimodal visual‑question‑answering (VQA) suite. By filling the long‑standing gap of annotated CT data, the authors enable systematic evaluation and fine‑tuning of AI models that must both “see” a lesion and “talk” about it—an essential step toward trustworthy, assistive radiology tools.

Key Contributions

A large, curated lesion dataset – 20,335 lesions from 7,795 CT studies, each annotated with a bounding box, free‑form description, and quantitative size information.
A multimodal VQA benchmark – 2,850 question‑answer pairs covering four core tasks: lesion localization, textual description, size estimation, and attribute categorization (e.g., solid vs. cystic).
Hard‑negative examples – deliberately confusing non‑lesion regions and ambiguous cases to mimic real diagnostic difficulty.
Comprehensive evaluation – state‑of‑the‑art vision‑language models (including medical CLIP variants) are benchmarked against radiologist performance on both the image‑metadata and VQA tasks.
Demonstrated fine‑tuning gains – models pre‑trained on generic image‑text data achieve sizable improvements after fine‑tuning on CT‑Bench, confirming the dataset’s clinical relevance.

Methodology

Data collection & annotation
- Retrospective CT scans were sourced from multiple institutions, de‑identified, and split into training/validation/test sets.
- Board‑certified radiologists drew axis‑aligned bounding boxes around each lesion, wrote concise natural‑language descriptions (e.g., “spiculated mass in the right upper lobe”), and recorded the longest axial diameter.
VQA construction
- For every lesion, four question templates were instantiated (e.g., “Where is the lesion located?”, “What is its size?”).
- Answers were either coordinates (for localization), free‑text (for description), numeric values (for size), or categorical labels (for attributes).
- Hard negatives were generated by pairing lesions with mismatched questions or by selecting nearby non‑lesion patches.
Model evaluation
- Baseline vision‑language architectures (ViLT, BLIP, MedCLIP) were trained on the image‑metadata task (bounding‑box regression + description generation) and on the VQA task (question encoding + answer prediction).
- Performance was measured against a panel of radiologists using standard metrics: IoU for localization, BLEU/ROUGE for description, mean absolute error for size, and accuracy for attribute classification.

Results & Findings

Task	Radiologist Avg.	Best Model (MedCLIP‑FT)	Gap
Lesion Localization (IoU)	0.78	0.71	0.07
Description Generation (BLEU‑4)	0.62	0.55	0.07
Size Estimation (MAE, mm)	2.1	3.4	+1.3
Attribute Categorization (Acc.)	0.88	0.81	0.07

Fine‑tuning on CT‑Bench consistently narrowed the gap across all tasks (average improvement of 12 % relative to off‑the‑shelf models).
Hard‑negative cases caused the biggest performance drops, highlighting that current models still struggle with subtle visual cues that radiologists handle intuitively.
Cross‑task transfer: models trained on the image‑metadata task transferred well to VQA, suggesting that learning lesion‑level visual representations benefits downstream reasoning.

Practical Implications

Accelerated development of AI‑assisted reporting – developers can now train and benchmark models that automatically generate lesion descriptions and size estimates, reducing radiologists’ dictation workload.
Improved triage systems – accurate lesion localization and attribute classification enable automated flagging of high‑risk findings (e.g., spiculated nodules) for priority review.
Foundation for multimodal clinical decision support – the VQA format mirrors real‑world queries (“Is this lesion larger than 1 cm?”), paving the way for conversational AI assistants that can answer radiology‑specific questions on the fly.
Benchmark‑driven research – open access to CT‑Bench encourages reproducibility and fair comparison, fostering community‑wide progress on multimodal medical vision models.

Limitations & Future Work

Dataset diversity – while the collection spans several hospitals, it is still biased toward adult thoracic CTs; abdominal or pediatric lesions are under‑represented.
Annotation granularity – bounding boxes are coarse; pixel‑level segmentation could enable more precise size and shape analysis.
Answer scope – the VQA set focuses on four task categories; expanding to differential diagnosis or treatment recommendation questions would broaden clinical relevance.
Model generalization – current experiments show performance drops when models are tested on external institutions not seen during fine‑tuning, indicating a need for domain‑robust training strategies.

Overall, CT‑Bench marks a pivotal step toward truly multimodal AI that can both see and describe lesions in CT scans, offering a practical platform for developers to build the next generation of radiology assistants.

Authors

Qingqing Zhu
Qiao Jin
Tejas S. Mathai
Yin Fang
Zhizheng Wang
Yifan Yang
Maame Sarfo-Gyamfi
Benjamin Hou
Ran Gu
Praveen T. S. Balamuralikrishna
Kenneth C. Wang
Ronald M. Summers
Zhiyong Lu

Paper Information

arXiv ID: 2602.14879v1
Categories: cs.CV, cs.AI
Published: February 16, 2026
PDF: Download PDF

[Paper] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Are Object-Centric Representations Better At Compositional Generalization?

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

[Paper] B-DENSE: Branching For Dense Ensemble Network Learning

[Paper] Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation