[Paper] Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?
Source: arXiv - 2606.12250v1
Overview
Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.
Key Contributions
This paper presents research in the following areas:
- cs.CL
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.CL.
Authors
- Antoni Lasik
- Jakub Pokrywka
- Łukasz Grzybowski
- Jeremi Ignacy Kaczmarek
- Gabriela Korzańska
- Janusz Świeczkowski-Feiz
- Oskar Pastuszek
- Paulina Hoffman
- Jakub Tomasz Dąbrowski
- Wojciech Kusa
Paper Information
- arXiv ID: 2606.12250v1
- Categories: cs.CL
- Published: June 10, 2026
- PDF: Download PDF