[Paper] Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

Published: 3 days ago (June 10, 2026 at 11:52 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.12250v1

Overview

Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Antoni Lasik
Jakub Pokrywka
Łukasz Grzybowski
Jeremi Ignacy Kaczmarek
Gabriela Korzańska
Janusz Świeczkowski-Feiz
Oskar Pastuszek
Paulina Hoffman
Jakub Tomasz Dąbrowski
Wojciech Kusa

Paper Information

arXiv ID: 2606.12250v1
Categories: cs.CL
Published: June 10, 2026
PDF: Download PDF

[Paper] Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

[Paper] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents