[Paper] Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

Published: (January 6, 2026 at 01:18 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03232v1

Overview

The paper introduces RXL‑RADSet, a new synthetic benchmark of 1,600 radiology reports that span ten different Reporting and Data Systems (RADS) such as BI‑RADS, LI‑RADS, and Lung‑RADS. By feeding these reports to 41 open‑weight small language models (SLMs) and to a proprietary “GPT‑5.2” model, the authors evaluate how well current LLMs can automatically assign the correct RADS label—a task that is notoriously hard because the guidelines are intricate and the output format is tightly constrained.

Key Contributions

  • RXL‑RADSet dataset: 1,600 radiologist‑verified synthetic reports covering 10 RADS categories and multiple imaging modalities.
  • Comprehensive benchmark: Head‑to‑head evaluation of 41 quantized open‑weight models (0.135 B – 32 B parameters) plus a proprietary GPT‑5.2 model.
  • Prompting study: Systematic comparison of guided prompting (structured prompt with explicit instructions) vs. zero‑shot prompting.
  • Scaling analysis: Empirical evidence that model performance improves sharply once the parameter count passes ~10 B, with a clear inflection point between sub‑1 B and ≥10 B models.
  • Error taxonomy: Identification that most accuracy loss on complex RADS stems from classification difficulty rather than malformed outputs.

Methodology

  1. Synthetic report generation – The authors first built scenario “plans” for each RADS category (e.g., typical findings, edge cases) and used existing LLMs to write reports in the style of radiologists.
  2. Two‑stage radiologist verification – A first reviewer checked for factual consistency; a second reviewer confirmed the correct RADS label, yielding a high‑quality ground truth.
  3. Model suite – 41 open‑weight SLMs from 12 families (e.g., LLaMA, Mistral, Falcon) were quantized to run efficiently on commodity GPUs. GPT‑5.2 served as the proprietary baseline.
  4. Prompt design – All models received a fixed guided prompt that explicitly asked for the RADS label and the required output format. A parallel zero‑shot run omitted the guidance.
  5. Evaluation metrics
    • Validity: Does the model output a syntactically correct RADS label?
    • Accuracy: Does the label match the radiologist‑verified ground truth?
      Both metrics were computed per‑report and aggregated across the entire benchmark.

Results & Findings

Model family (size)ValidityAccuracy
GPT‑5.2 (proprietary)99.8 %81.1 %
All SLMs (pooled)96.8 %61.1 %
Top SLMs (20‑32 B)≈99 %70‑78 %
  • Scaling effect: Models under 1 B parameters hover around 90 % validity and 45 % accuracy, while those ≥10 B jump to >95 % validity and >70 % accuracy.
  • Prompt impact: Guided prompting raises validity from 96.7 % (zero‑shot) to 99.2 % and lifts accuracy from 69.6 % to 78.5 %.
  • Complexity penalty: RADS schemes with more granular categories (e.g., PI‑RADS, VI‑RADS) show larger drops in accuracy, driven mainly by mis‑classification rather than malformed outputs.

Practical Implications

  • Clinical decision support – Even mid‑size open‑weight models (≈20 B) can reliably extract RADS scores from narrative reports, opening the door to automated triage, audit pipelines, and quality‑control dashboards in radiology departments.
  • Cost‑effective deployment – Quantized SLMs run on a single GPU, meaning hospitals or health‑tech startups can achieve near‑proprietary performance without expensive API calls.
  • Standardization across modalities – Because RXL‑RADSet spans CT, MRI, ultrasound, and mammography, a single model can be fine‑tuned or prompted to handle multi‑modality reporting, reducing the need for modality‑specific parsers.
  • Regulatory reporting – Automated RADS assignment can help meet compliance requirements (e.g., BI‑RADS for breast cancer screening) by flagging reports that lack a proper score.

Limitations & Future Work

  • Synthetic nature – Although radiologist‑verified, the reports are generated by LLMs and may not capture the full variability of real‑world dictations, especially rare edge cases.
  • Scope of RADS – The benchmark covers ten RADS systems, but many subspecialties (e.g., pediatric radiology) use additional or customized scoring schemes.
  • Model diversity – Only quantized versions of open‑weight models were tested; larger, sparsely‑activated or retrieval‑augmented models could shift the performance curve.
  • Prompt engineering – The study used a single guided prompt; exploring prompt ensembles or chain‑of‑thought prompting could further close the gap with proprietary models.

Bottom line: RXL‑RADSet provides a much‑needed, openly available yardstick for RADS extraction, and the results suggest that with the right prompting strategy, developers can now build practical, low‑cost LLM‑powered tools for radiology reporting without relying exclusively on closed‑source APIs.

Authors

  • Kartik Bose
  • Abhinandan Kumar
  • Raghuraman Soundararajan
  • Priya Mudgil
  • Samonee Ralmilay
  • Niharika Dutta
  • Manphool Singhal
  • Arun Kumar
  • Saugata Sen
  • Anurima Patra
  • Priya Ghosh
  • Abanti Das
  • Amit Gupta
  • Ashish Verma
  • Dipin Sudhakaran
  • Ekta Dhamija
  • Himangi Unde
  • Ishan Kumar
  • Krithika Rangarajan
  • Prerna Garg
  • Rachel Sequeira
  • Sudhin Shylendran
  • Taruna Yadav
  • Tej Pal
  • Pankaj Gupta

Paper Information

  • arXiv ID: 2601.03232v1
  • Categories: cs.CL, cs.AI
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »