[Paper] Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

Published: (June 10, 2026 at 11:52 AM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.12250v1

Overview

Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.

Key Contributions

This paper presents research in the following areas:

  • cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Antoni Lasik
  • Jakub Pokrywka
  • Łukasz Grzybowski
  • Jeremi Ignacy Kaczmarek
  • Gabriela Korzańska
  • Janusz Świeczkowski-Feiz
  • Oskar Pastuszek
  • Paulina Hoffman
  • Jakub Tomasz Dąbrowski
  • Wojciech Kusa

Paper Information

  • arXiv ID: 2606.12250v1
  • Categories: cs.CL
  • Published: June 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »