[Paper] MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

Published: (December 29, 2025 at 03:48 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23304v1

Overview

A new study pits an open‑source, domain‑tuned multimodal model (MedGemma‑4B‑IT) against the heavyweight proprietary GPT‑4 for zero‑shot medical disease classification from imaging data. By fine‑tuning MedGemma with a lightweight LoRA adapter, the authors achieve a mean accuracy of 80.37 %, outpacing the 69.58 % of untuned GPT‑4 across six disease categories. The results highlight how targeted adaptation can make open‑source models not only competitive but also more reliable for high‑risk clinical tasks.

Key Contributions

  • Head‑to‑head benchmark of an open‑source multimodal agent (MedGemma) vs. GPT‑4 on six disease classification tasks.
  • LoRA‑based fine‑tuning of the 4‑billion‑parameter MedGemma model, demonstrating that a few hundred thousand trainable parameters can yield large performance gains.
  • Comprehensive evaluation using accuracy, sensitivity, confusion matrices, and classification reports, with a focus on high‑stakes conditions (cancer, pneumonia).
  • Evidence that domain‑specific fine‑tuning reduces hallucinations, making the model’s outputs more trustworthy for clinical decision support.
  • Open‑source reproducibility: the authors release the LoRA weights and inference scripts, enabling the community to build on their work.

Methodology

  1. Data Collection – The authors assembled a curated set of medical images (e.g., chest X‑rays, CT scans) labeled for six diseases, ensuring a balanced test split.
  2. Model Preparation
    • MedGemma‑4B‑IT: A 4‑billion‑parameter multimodal LLM pre‑trained on generic image‑text pairs.
    • GPT‑4: Accessed via the official API, used in a zero‑shot fashion (no task‑specific prompting or fine‑tuning).
  3. Fine‑Tuning with LoRA – Low‑Rank Adaptation injects trainable low‑dimensional matrices into each transformer layer, leaving the base weights frozen. This approach requires <0.5 % of the original parameters, drastically reducing compute and memory needs.
  4. Prompt Engineering – Both models receive the same textual prompt: “Given the following image, list the most likely disease from the set {…}.” The prompt is kept simple to isolate the effect of model architecture and fine‑tuning.
  5. Evaluation – Standard classification metrics (accuracy, precision, recall, F1) are computed per disease, and confusion matrices are visualized to reveal systematic error patterns.

Results & Findings

ModelMean AccuracyCancer RecallPneumonia Recall
MedGemma‑4B‑IT (LoRA‑tuned)80.37 %87 %84 %
GPT‑4 (zero‑shot)69.58 %71 %68 %
  • Higher Sensitivity: MedGemma shows a 16‑point boost in cancer detection recall, a critical metric for life‑threatening conditions.
  • Reduced Hallucinations: Qualitative analysis reveals fewer instances where the model fabricates disease names not present in the label set.
  • Error Distribution: Confusion matrices indicate that GPT‑4 tends to misclassify pneumonia as “viral infection” (a non‑target class), whereas MedGemma’s errors are more confined to visually similar diseases (e.g., distinguishing bacterial vs. viral pneumonia).
  • Inference Speed: On a single RTX 4090, MedGemma processes an image in ~0.12 s, while GPT‑4’s API latency averages ~0.45 s per request (including network overhead).

Practical Implications

  • Cost‑Effective Deployment: Organizations can run MedGemma locally on commodity GPUs, eliminating recurring API fees and data‑privacy concerns associated with cloud‑only solutions like GPT‑4.
  • Regulatory Friendly: Open‑source models with transparent fine‑tuning pipelines simplify audit trails, a key requirement for FDA‑cleared AI medical devices.
  • Rapid Adaptation: LoRA enables teams to re‑train the model for new disease categories or imaging modalities (e.g., MRI) with minimal compute, supporting agile product roadmaps.
  • Edge‑Ready Use Cases: The lightweight inference footprint makes MedGemma suitable for point‑of‑care devices, tele‑radiology platforms, and mobile health apps that need on‑device reasoning.
  • Hybrid Systems: Developers can combine MedGemma’s high sensitivity for critical conditions with GPT‑4’s broader knowledge base for ancillary tasks (e.g., generating patient summaries), achieving a best‑of‑both‑worlds workflow.

Limitations & Future Work

  • Dataset Scope: The benchmark covers only six diseases and a limited number of imaging modalities; broader validation is needed for general clinical adoption.
  • Zero‑Shot GPT‑4 Baseline: The study uses GPT‑4 without any prompt engineering or few‑shot examples, which may understate its true capability. Future work could explore optimized prompting strategies.
  • Explainability: While MedGemma reduces hallucinations, the paper does not provide visual explanations (e.g., attention maps) that clinicians often require for trust.
  • Regulatory Pathway: The authors acknowledge that additional safety testing, bias analysis, and prospective clinical trials are required before deployment in real‑world settings.

Bottom line: With a modest LoRA fine‑tune, an open‑source multimodal LLM can outperform a leading proprietary model on critical medical imaging tasks, opening the door for cost‑effective, privacy‑preserving AI tools in healthcare.

Authors

  • Md. Sazzadul Islam Prottasha
  • Nabil Walid Rafi

Paper Information

  • arXiv ID: 2512.23304v1
  • Categories: cs.CV, cs.AI
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »