[Paper] Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Source: arXiv - 2512.18880v1
Overview
This paper investigates whether today’s large language models (LLMs) can feel the same difficulty that human learners experience when solving exam‑style items. By comparing model‑predicted difficulty scores against human judgments across more than 20 LLMs and several domains (medical knowledge, math reasoning, etc.), the authors uncover a systematic misalignment: bigger or more capable models do not become better at estimating how hard a question is for a student.
Key Contributions
- Large‑scale Human‑AI difficulty alignment study – evaluated 20+ LLMs on >10k items spanning multiple subjects.
- Empirical evidence of a “machine consensus” – models converge on a shared notion of difficulty that diverges from human perception, regardless of model size.
- Proficiency‑simulation prompting analysis – explicit prompts asking the model to adopt a low‑proficiency persona still fail to reproduce human‑like difficulty estimates.
- Introspection gap quantification – models cannot reliably predict their own failure modes or confidence, highlighting a lack of self‑awareness.
- Practical benchmark & dataset release – the authors open‑source the item difficulty dataset and evaluation scripts for future research.
Methodology
- Item Collection – Curated a benchmark of thousands of multiple‑choice and open‑ended questions from standardized tests, medical board exams, and math competitions. Each item already has human difficulty ratings (e.g., percent of test‑takers who answer correctly).
- Model Suite – Ran inference on 20+ LLMs ranging from 125 M to 175 B parameters, including open‑source (LLaMA, Falcon) and commercial APIs (GPT‑4, Claude).
- Prompt Designs
- Direct difficulty query: “On a scale of 1‑10, how hard is this question for a typical high‑school student?”
- Proficiency simulation: “Answer as if you are a student who only knows basic algebra.”
- Alignment Metrics – Computed Pearson/Spearman correlation between model‑predicted scores and human difficulty, plus calibration curves to see how often models’ confidence matches actual correctness.
- Statistical Controls – Controlled for item length, topic, and answer format to isolate the effect of model size and prompting style.
Results & Findings
| Metric | Human‑Model Correlation (best) | Typical Correlation Across Models |
|---|---|---|
| Pearson (direct query) | 0.42 (GPT‑4) | 0.15 – 0.35 |
| Pearson (proficiency simulation) | 0.38 (Claude) | 0.10 – 0.30 |
| Calibration error (confidence vs. correctness) | 0.22 (GPT‑4) | 0.30 – 0.55 |
- Scaling paradox: Larger models (GPT‑4, Claude) achieve higher raw accuracy on the items but lower alignment with human difficulty.
- Shared machine consensus: Across architectures, models rate “tricky” items as easy and vice‑versa, suggesting they rely on pattern‑based solvability rather than cognitive load.
- Prompting limits: Even when forced to “pretend” to be a novice, models still over‑estimate their own ability, producing difficulty scores that remain poorly correlated with human data.
- Introspection failure: Models rarely flag their own uncertainty; confidence scores are poorly calibrated, making it hard to detect when a prediction is likely wrong.
Practical Implications
- Automated test design – Relying on LLMs to auto‑grade or generate difficulty‑balanced question banks is risky; human validation remains essential.
- Adaptive learning platforms – Systems that personalize content using LLM‑estimated difficulty may mis‑target learners, potentially causing frustration or disengagement.
- AI‑assisted tutoring – Prompting an LLM to simulate a learner’s knowledge level does not reliably produce appropriate scaffolding; developers should combine LLM outputs with explicit student performance data.
- Model‑driven curriculum analytics – The observed “machine consensus” could be useful for identifying items that are algorithmically easy but human‑wise hard, informing hybrid assessment strategies.
In short, while LLMs excel at solving problems, they are not yet trustworthy judges of how hard those problems feel to humans. Developers should treat LLM‑generated difficulty scores as rough heuristics, not definitive metrics.
Limitations & Future Work
- Domain coverage – The benchmark focuses on high‑stakes academic subjects; real‑world tasks (coding interviews, soft‑skill assessments) may behave differently.
- Prompt diversity – Only a handful of prompting styles were explored; richer role‑playing or chain‑of‑thought prompts could improve alignment.
- Student modeling granularity – Human difficulty ratings are aggregated; future work could incorporate individual learner profiles to test fine‑grained alignment.
- Model introspection mechanisms – Investigating auxiliary training objectives (e.g., confidence calibration, self‑awareness) may help bridge the introspection gap.
The authors encourage the community to build on their dataset and explore hybrid approaches that combine LLM reasoning with human‑in‑the‑loop feedback for more reliable difficulty estimation.
Authors
- Ming Li
- Han Chen
- Yunze Xiao
- Jian Chen
- Hong Jiao
- Tianyi Zhou
Paper Information
- arXiv ID: 2512.18880v1
- Categories: cs.CL, cs.AI, cs.CY
- Published: December 21, 2025
- PDF: Download PDF