[Paper] When Elo Lies: Hidden Biases in Codeforces-Based Evaluation of Large Language Models
Source: arXiv - 2602.05891v1
Overview
The paper When Elo Lies: Hidden Biases in Codeforces‑Based Evaluation of Large Language Models exposes a serious reliability problem in the way many researchers and product teams benchmark LLMs on competitive‑programming tasks. By dissecting the hidden variables that sway Codeforces Elo scores, the authors show that the same model can appear dramatically stronger or weaker depending on experimental quirks—raising red flags for anyone using these numbers to guide development or marketing decisions.
Key Contributions
- Systematic audit of Elo‑based LLM evaluation – identifies three major sources of hidden bias: submission order, contest difficulty selection, and stochastic run‑to‑run variability.
- Large‑scale controlled benchmark – runs 13,691 generated test cases across 37 recent Codeforces contests, providing a reproducible dataset for future work.
- Quantitative impact analysis – demonstrates that:
- Changing the order of submissions can swing Elo by ≈ 394 points.
- Selecting different subsets of contests can shift scores by up to 1,122 points for the same model.
- Re‑running the same evaluation yields a mean Elo variance of ≈ 349 points.
- Guidelines for reliable reporting – proposes a minimal set of experimental metadata (seed, contest list, submission schedule) that must accompany any Elo‑based claim.
Methodology
- Model selection & prompting – The authors used several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude, LLaMA‑2) and a uniform “solve‑the‑problem” prompt to keep the interaction style constant.
- Contest pool construction – 37 Codeforces contests released in the last six months were chosen, covering a range of difficulty tiers (Div. 2 A–F, Div. 1).
- Test case generation – For each problem, 13,691 input instances were automatically generated (using the official problem generators when available, otherwise via random sampling respecting constraints).
- Elo computation pipeline – Submissions were fed to the LLMs, the model’s answer was judged against the official checker, and a virtual “player” earned or lost Elo points exactly as a human contestant would.
- Bias experiments:
- Submission order: Randomly permuted the order of problem instances across 100 runs.
- Contest selection: Evaluated every possible combination of 10‑contest subsets (sampled uniformly) to see how the choice of contests changes the final rating.
- Run‑to‑run variability: Re‑executed the entire pipeline 30 times with identical settings but different random seeds, capturing the stochastic nature of LLM generation.
All code, data, and the full evaluation script are released under an open‑source license to enable replication.
Results & Findings
| Factor | Observed Elo swing (max) | Interpretation |
|---|---|---|
| Submission order | 394 points | Early successes boost the rating more than later ones because Elo updates are multiplicative; shuffling can therefore artificially inflate or deflate scores. |
| Contest selection | 1,122 points | Some contests contain a higher proportion of “trick” problems that LLMs handle poorly; omitting them can make a model look far stronger. |
| Run‑to‑run stochasticity | 349 points (mean difference) | Temperature‑based sampling and nondeterministic token selection cause answer variance even on identical inputs, leading to non‑trivial rating jitter. |
Overall, the authors conclude that direct Elo comparisons across papers are unreliable unless the exact experimental configuration is disclosed. The magnitude of the swings dwarfs typical performance gaps reported in the literature, meaning many claimed “state‑of‑the‑art” improvements could be artifacts of evaluation design rather than genuine model advances.
Practical Implications
- For product teams: Relying on a single Elo number to market an LLM’s “coding prowess” can be misleading. Teams should supplement Elo with more deterministic metrics (e.g., pass‑rate on a fixed test suite) and always report the contest list and submission schedule.
- For researchers: When publishing new LLM benchmarks, include a reproducibility checklist: random seed, problem generator version, exact contest IDs, and the order of problem presentation. This will make peer comparison meaningful.
- For tooling vendors: Automated evaluation platforms (e.g., OpenAI’s eval suite, EvalAI) should expose configuration options for ordering and allow users to lock a “canonical” contest set, reducing inadvertent bias.
- For the community: The findings encourage a shift toward aggregate metrics (e.g., average correctness, time‑to‑solve) rather than a single Elo score, especially for large‑scale leaderboards where fairness is paramount.
Limitations & Future Work
- Scope of contests: The study focuses on recent Codeforces rounds; older or non‑Codeforces platforms (AtCoder, LeetCode) might exhibit different bias patterns.
- Model diversity: Only a handful of publicly available LLMs were tested; proprietary models with different decoding strategies could behave differently.
- Prompt engineering: The authors used a fixed prompt; exploring how prompt variations interact with the identified biases is an open question.
- Long‑term stability: Future work could examine how model updates (e.g., fine‑tuning on competitive‑programming data) affect the sensitivity of Elo over time.
By shedding light on these hidden variables, the paper pushes the community toward more transparent, reproducible, and trustworthy evaluation practices for LLMs in competitive programming and beyond.
Authors
- Shenyu Zheng
- Ximing Dong
- Xiaoshuang Liu
- Gustavo Oliva
- Chong Chun Yong
- Dayi Lin
- Boyuan Chen
- Shaowei Wang
- Ahmed E. Hassan
Paper Information
- arXiv ID: 2602.05891v1
- Categories: cs.SE
- Published: February 5, 2026
- PDF: Download PDF