Study: Platforms that rank the latest LLMs can be unreliable
Source: MIT News - AI
Overview
A firm that wants to use a large language model (LLM) to summarize sales reports or triage customer inquiries can choose between hundreds of unique LLMs with dozens of model variations, each with slightly different performance.
To narrow down the choice, companies often rely on LLM ranking platforms, which gather user feedback on model interactions to rank the latest LLMs based on how they perform on certain tasks.
MIT researchers discovered that just a handful of user interactions can skew the results, leading someone to mistakenly believe one LLM is the ideal choice for a particular use case. Their study shows that removing a tiny fraction of crowdsourced data can change which models are top‑ranked.
They developed a fast method to test ranking platforms and determine whether they are susceptible to this problem. The evaluation technique identifies the individual votes most responsible for skewing the results so users can inspect these influential votes.
The researchers say this work underscores the need for more rigorous strategies to evaluate model rankings. While they didn’t focus on mitigation in this study, they provide suggestions that may improve the robustness of these platforms, such as gathering more detailed feedback to create the rankings.
The study also offers a word of warning to users who may rely on rankings when making decisions about LLMs that could have far‑reaching and costly impacts on a business or organization.
“We were surprised that these ranking platforms were so sensitive to this problem. If it turns out the top‑ranked LLM depends on only two or three pieces of user feedback out of tens of thousands, then one can’t assume the top‑ranked LLM is going to be consistently outperforming all the other LLMs when it is deployed,”
— Tamara Broderick, Associate Professor, MIT EECS (senior author)
She is joined on the paper by lead authors and EECS graduate students Jenny Huang and Yunyi Shen, as well as Dennis Wei, a senior research scientist at IBM Research. The study will be presented at the International Conference on Learning Representations (ICLR).
Dropping Data
While there are many types of LLM ranking platforms, the most popular variations ask users to submit a query to two models and pick which LLM provides the better response.
The platforms aggregate the results of these match‑ups to produce rankings that show which LLM performed best on certain tasks (e.g., coding, visual understanding).
By choosing a top‑performing LLM, a user likely expects that model’s top ranking to generalize—i.e., it should outperform other models on similar, but not identical, applications with a new dataset.
The MIT researchers previously studied generalization in statistics and economics. That work revealed cases where dropping a small percentage of data can change a model’s results, indicating that those studies’ conclusions might not hold beyond their narrow setting.
They wanted to see if the same analysis could be applied to LLM ranking platforms.
“At the end of the day, a user wants to know whether they are choosing the best LLM. If only a few prompts are driving this ranking, that suggests the ranking might not be the end‑all‑be‑all,”
— Broderick
Testing the data‑dropping phenomenon manually would be infeasible. For example, one ranking they evaluated contained 57,000+ votes. Dropping just 0.1 % means removing each possible subset of 57 votes (over 10,194 subsets) and recomputing the ranking each time.
Instead, the researchers developed an efficient approximation method, based on prior work, and adapted it to fit LLM ranking systems.
“While we have theory to prove the approximation works under certain assumptions, the user doesn’t need to trust that. Our method tells the user the problematic data points at the end, so they can just drop those data points, re‑run the analysis, and check to see if they get a change in the rankings,”
— Broderick
Surprisingly Sensitive
When the technique was applied to popular ranking platforms, the researchers were surprised by how few data points were needed to cause significant changes in the top LLMs:
| Platform | Votes Analyzed | Votes Dropped | % Dropped | Effect |
|---|---|---|---|---|
| Platform A (crowdsourced) | > 57,000 | 2 | 0.0035 % | Top‑ranked model flipped |
| Platform B (expert annotators, higher‑quality prompts) | 2,575 | 83 | ≈ 3 % | Top model changed |
Their examination revealed that many influential votes may have resulted from user error. In some cases, there was a clear answer as to which LLM performed better, but the user selected the other model.
“We can never know what was in the user’s mind at that time, but maybe they mis‑clicked, weren’t paying attention, or honestly didn’t know which one was better. The big takeaway is that you don’t want noise, user error, or an outlier determining which is the top‑ranked LLM,”
— Broderick
Suggested Mitigations
- Collect richer feedback – e.g., ask users to indicate confidence levels for each vote.
- Introduce human mediators to review crowdsourced responses.
- Increase the volume and diversity of evaluations to dilute the impact of any single erroneous vote.
The researchers plan to continue exploring generalization in other contexts while also developing better approximation methods that can capture more examples of non‑robustness.
“Broderick and her students’ work shows how you can get valid estimates of the influence of specific data points, enabling more trustworthy model‑ranking pipelines,”
— Excerpt continues in the full paper
Quote
“It’s tempting to think that downstream processes are robust, despite the intractability of exhaustive calculations given the size of modern machine‑learning models and datasets,” says Jessica Hullman, the Ginni Rometty Professor of Computer Science at Northwestern University, who was not involved with this work.
“The recent work provides a glimpse into the strong data dependencies in routinely applied — but also very fragile — methods for aggregating human preferences and using them to update a model. Seeing how few preferences could really change the behavior of a fine‑tuned model could inspire more thoughtful methods for collecting these data.”
Funding
This research is funded, in part, by:
- Office of Naval Research
- MIT‑IBM Watson AI Lab
- National Science Foundation
- Amazon
- CSAIL seed award