[Paper] Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

Published: 3 days ago (May 7, 2026 at 01:57 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06656v1

Overview

The paper Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML shows that the common practice of ranking large language models (LLMs) with a single global score (e.g., Bradley‑Terry or Elo) hides massive disagreement among users. By dissecting ~89 K pairwise human judgments across 116 languages and 52 LLMs, the authors demonstrate that “the best model” is often indistinguishable from many others, and that language‑specific sub‑populations actually have coherent, but mutually conflicting, preferences.

Key Contributions

Empirical audit of global LLM rankings: Analyzed 89 K human comparisons from the Arena benchmark, revealing that ~66 % of decisive votes cancel each other out and that the top‑50 models differ by less than a 0.53 win probability.
Identification of structured heterogeneity: Showed that language (and language families) is the dominant factor driving disagreement; grouping by language inflates Elo spread by two orders of magnitude.
$(\lambda,\nu)$‑portfolio framework: Introduced a formalism for building small sets of models that together satisfy a target error bound $\lambda$ for at least a fraction $\nu$ of users, casting the problem as a set‑cover variant with VC‑dimension guarantees.
Algorithmic solutions with provable coverage: Developed greedy‑style algorithms that recover only 5 distinct BT rankings covering >96 % of votes, compared with just 21 % coverage from a single global ranking.
Real‑world case studies: Constructed a 6‑model portfolio that doubles the vote coverage of the top‑6 globally‑ranked LLMs, and applied the portfolio idea to fairness‑regularized classifiers on the COMPAS dataset to expose “blind spots” useful for policy analysis.

Methodology

Data collection & preprocessing – The authors used the public Arena dataset, which contains pairwise human preference judgments for 52 LLMs across 116 languages. Each judgment indicates which model’s output a human prefers for a given prompt.
Global Bradley‑Terry (BT) fitting – They first fit a single BT model to all comparisons, yielding a global ranking and associated win probabilities.
Heterogeneity analysis – By slicing the data along language, task type, and time, they measured intra‑group agreement (e.g., Elo variance) versus inter‑group disagreement.
$(\lambda,\nu)$‑portfolio definition – For any user (or vote) set $U$, a portfolio $P$ of models satisfies the error bound $\lambda$ if at least a fraction $\nu$ of $U$ have a model in $P$ that beats the alternative with probability ≥ $1-\lambda$.
Set‑cover formulation – Each model is treated as a “set” of votes it can satisfy under the $\lambda$ threshold. Finding the smallest portfolio that covers $\nu$ of the votes becomes a classic set‑cover problem.
Algorithmic solution – A greedy algorithm selects models that maximize marginal coverage per iteration; theoretical guarantees are derived using the VC dimension of the vote‑model incidence matrix.
Evaluation – The resulting portfolios are evaluated on coverage, error, and diversity, and compared against the global BT ranking and against naive top‑k selections.

Results & Findings

Aspect	Global BT ranking	Language‑grouped BT rankings	$(\lambda,\nu)$‑portfolios
Coverage of votes	21 % (top‑50 models)	Up to 96 % with 5 language‑specific rankings	96 % with 5‑model portfolio (λ≈0.1)
Elo spread	~0.2 (very flat)	~20–30 (orders of magnitude larger)	Comparable to language‑grouped spread
Top‑6 model comparison	6 models cover ~12 % of votes	N/A (multiple groups)	6‑model portfolio covers ~24 % of votes
Statistical distinguishability	Pairwise win prob ≤ 0.53 within top‑50	Clear separation within language groups	Portfolio ensures ≤ λ error for covered users

Key takeaways:

The “global best” model is statistically indistinguishable from many others.
Language is the primary driver of coherent sub‑preferences; when accounted for, rankings become meaningful.
Small, well‑chosen portfolios dramatically increase the proportion of users who receive a model that meets their expectations.

Practical Implications

Product teams can serve multiple “regional” models instead of a single “global” LLM, improving user satisfaction without a massive increase in infrastructure cost.
API providers can expose a “model portfolio” endpoint that returns a short list of candidate models tailored to a user’s language or domain, letting downstream services pick the best fit.
Evaluation pipelines should incorporate heterogeneity checks (e.g., language‑wise Elo variance) before publishing a single leaderboard score.
Fairness audits can leverage portfolios: by constructing ensembles of fairness‑regularized classifiers, stakeholders can identify demographic groups that are poorly served by any single model and target remedial data collection.
Set‑cover‑style algorithms are lightweight and can be integrated into model‑selection services to automatically maintain a minimal yet high‑coverage portfolio as new models are released.

Limitations & Future Work

The analysis is limited to the Arena benchmark; other tasks (e.g., code generation, retrieval‑augmented generation) may exhibit different heterogeneity patterns.
The $(\lambda,\nu)$‑portfolio framework assumes a binary “satisfied/not satisfied” vote model; extending it to graded preferences or multi‑turn interactions remains open.
The set‑cover formulation can become computationally expensive for extremely large model pools; scalable approximations or online updates are a promising direction.
Future work could explore dynamic portfolios that adapt to user feedback in real time, or combine language‑grouping with other axes such as domain expertise, latency constraints, or cost.

Authors

Jai Moondra
Ayela Chughtai
Bhargavi Lanka
Swati Gupta

Paper Information

arXiv ID: 2605.06656v1
Categories: cs.LG, cs.DM, cs.ET, math.OC
Published: May 7, 2026
PDF: Download PDF

[Paper] Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction