[Paper] MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge
Source: arXiv - 2605.05175v1
Overview
The paper introduces MRI‑Eval, a new, tiered benchmark designed to test large language models (LLMs) on two fronts that matter most to MRI researchers and technologists: core MRI physics and the nitty‑gritty of operating GE (General Electric) scanners. By moving beyond textbook‑style multiple‑choice questions, the authors expose gaps in models’ free‑text recall—especially for vendor‑specific workflow knowledge that can directly affect scan protocols and patient safety.
Key Contributions
- Tiered benchmark design covering 1,365 scored items across nine topic categories and three difficulty levels.
- Dual evaluation modes: standard MCQ (with answer options) and “stem‑only” free‑text prompting, plus a primed variant that tests models against deliberately incorrect user claims.
- Comprehensive content sources: modern textbooks, GE scanner manuals, programming course material, and expert‑crafted questions.
- Cross‑model comparison of five leading LLM families (GPT‑5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B).
- Empirical finding that high MCQ scores (≈ 93‑97 % accuracy) can mask severe weaknesses in open‑ended recall, particularly for GE‑specific operational knowledge (as low as ~14 % accuracy).
Methodology
-
Question Set Construction – The authors curated 1,365 items, splitting them into three difficulty tiers (easy, medium, hard) and nine categories (e.g., basic MRI physics, pulse‑sequence design, safety, GE console navigation, troubleshooting). Sources ranged from standard textbooks to actual GE service manuals and custom questions from domain experts.
-
Evaluation Modes
- MCQ – Traditional multiple‑choice format where the model selects the correct option.
- Stem‑only – The answer choices are stripped; the model must generate a free‑text answer. An independent LLM judge scores these responses for correctness.
- Primed Stem‑only – The same stem‑only prompt is prefixed with a plausible but wrong user claim (e.g., “The gradient coil is cooled by water”), testing whether the model can override misinformation.
-
Model Families – Five state‑of‑the‑art LLMs were queried via their public APIs, using the same prompts across all modes to ensure a fair head‑to‑head comparison.
-
Scoring – MCQ accuracy is straightforward. For stem‑only, the judge LLM assigns a binary correct/incorrect label based on domain‑specific criteria.
Results & Findings
| Model | MCQ Accuracy | Stem‑only Accuracy | GE Ops MCQ | GE Ops Stem‑only |
|---|---|---|---|---|
| GPT‑5.4 | 97.1 % | 61.1 % | 94.6 % | 29.8 % |
| Claude Opus 4.6 | 95.8 % | 58.4 % | 92.3 % | 23.5 % |
| Claude Sonnet 4.6 | 94.9 % | 60.2 % | 90.1 % | 21.7 % |
| Gemini 2.5 Pro | 93.6 % | 59.0 % | 88.2 % | 13.8 % |
| Llama 3.3 70B | 93.2 % | 37.1 % | 89.0 % | 15.4 % |
- High MCQ scores: All models exceed 93 % accuracy, suggesting they have memorized the textbook‑style answer keys.
- Drop in stem‑only: When forced to recall information without cues, accuracy collapses to 37‑61 %, revealing limited internal representation.
- Vendor‑specific weakness: The GE scanner operations category consistently lags behind physics or safety topics, especially in the stem‑only condition (as low as ~14 % for Gemini).
- Primed tests: Models often reproduced the incorrect claim, indicating susceptibility to user misinformation—a critical risk for clinical decision support.
Practical Implications
- Caution for “AI‑assisted protocol design” – Relying on raw LLM outputs to generate or verify GE‑specific scan parameters could propagate errors, potentially compromising image quality or patient safety.
- Tooling for MRI technologists – MRI‑Eval can serve as a regression suite for vendors building domain‑specific assistants, ensuring that updates improve free‑text recall, not just MCQ performance.
- Hybrid workflows – Combining LLMs with rule‑based checks (e.g., cross‑referencing against official GE console manuals) can mitigate the hallucination risk highlighted by the primed stem‑only experiments.
- Training data considerations – The stark contrast between MCQ and stem‑only performance suggests that many commercial LLMs are heavily tuned on curated QA datasets. Incorporating more procedural documentation (service manuals, SOPs) into fine‑tuning pipelines could close the vendor‑knowledge gap.
- Benchmark adoption – MRI‑Eval provides a reproducible, tiered test bed that can be integrated into CI pipelines for any LLM intended for radiology or research MRI environments.
Limitations & Future Work
- Scope limited to GE scanners – Other major vendors (Siemens, Philips) are not covered, so the benchmark’s findings may not generalize across the whole MRI ecosystem.
- Reliance on an LLM judge – The stem‑only scoring depends on another model’s judgment, which could introduce bias; human expert validation would strengthen the results.
- Static question set – While large, the 1,365 items are fixed; future work could include a dynamic question‑generation component to test models on truly novel scenarios.
- Real‑world deployment testing – The study stops at offline evaluation; integrating the benchmark into live clinical decision‑support tools would reveal additional usability and safety considerations.
Authors
- Perry E. Radau
Paper Information
- arXiv ID: 2605.05175v1
- Categories: eess.IV, cs.CL, physics.med-ph
- Published: May 6, 2026
- PDF: Download PDF