[Paper] Discovering Hidden Gems in Model Repositories
Source: arXiv - 2601.22157v1
Overview
The paper investigates a surprising blind spot in today’s model marketplaces: despite millions of fine‑tuned checkpoints being publicly available, most developers only ever use a handful of “well‑known” models. By systematically evaluating more than 2,000 checkpoints, the authors reveal a wealth of “hidden gems”—models that are rarely downloaded yet dramatically outperform the popular choices, all without extra inference cost.
Key Contributions
- Empirical audit of model repositories – a large‑scale benchmark of >2 000 fine‑tuned checkpoints across several families (e.g., Llama‑3.1‑8B).
- Discovery of high‑performing, low‑visibility models – e.g., a rarely‑downloaded Llama‑3.1‑8B variant that lifts math accuracy from 83.2 % to 96.0 % with identical latency.
- Formulation of model discovery as a Multi‑Armed Bandit (MAB) problem – treating each checkpoint as an “arm” to be sampled efficiently.
- Accelerated Sequential Halving algorithm – introduces shared query sets and aggressive elimination schedules, cutting the number of required evaluations by >50× (≈50 queries per candidate).
- Open‑source toolkit – code and benchmark data released to enable the community to replicate and extend the search pipeline.
Methodology
-
Benchmark Construction
- Collected checkpoints from popular public hubs (Hugging Face, ModelScope, etc.).
- Defined a shared evaluation suite (≈200 diverse prompts covering reasoning, coding, math, and language understanding).
-
Baseline Exhaustive Evaluation
- Ran the full suite on every model to establish a ground‑truth performance ranking (computationally expensive, used only for validation).
-
Multi‑Armed Bandit Framing
- Each model = an arm. Pulling an arm = evaluating the model on a small batch of queries.
- Goal: identify the top‑k arms with the fewest pulls.
-
Sequential Halving with Enhancements
- Shared query pool: the same mini‑batch of prompts is reused across all candidates in a round, reducing variance and overhead.
- Aggressive elimination: after each round, only the top‑fraction (e.g., 30 %) of models survive, dramatically shrinking the candidate set.
- Adaptive budget: early rounds use very few queries (≈10), later rounds allocate more (≈100) to the remaining few models.
-
Evaluation
- Compared the accelerated search against exhaustive evaluation and vanilla Sequential Halving on speed‑accuracy trade‑offs.
Results & Findings
| Metric | Exhaustive (baseline) | Accelerated Search |
|---|---|---|
| Avg. queries per model | 200 (full suite) | ≈50 |
| Speed‑up factor | 1× | >50× |
| Top‑5 model recall | 100 % | 96 % |
| Example hidden gem (Llama‑3.1‑8B) | 83.2 % math accuracy (popular checkpoint) | 96.0 % (rare checkpoint) |
- The accelerated method consistently surfaces the highest‑performing checkpoints while using a fraction of the compute.
- Hidden gems were not limited to math; several showed gains in code generation and commonsense reasoning.
- No increase in inference latency or memory footprint was observed for the discovered models, confirming that the performance boost stems from better fine‑tuning rather than larger architectures.
Practical Implications
- Model selection pipelines: Teams can integrate the bandit‑based search to automatically surface superior checkpoints before committing to a production rollout, saving both time and cloud costs.
- Marketplace curation: Platform operators (e.g., Hugging Face) could run the algorithm in the background to surface “trending‑but‑unseen” models, improving discoverability for creators.
- Continuous fine‑tune evaluation: Developers who regularly upload fine‑tuned variants can receive rapid feedback on whether their checkpoint is a hidden gem, encouraging more diverse experimentation.
- Cost‑effective benchmarking: The shared query set approach means you can evaluate thousands of models on a single GPU cluster in a few hours instead of weeks.
Limitations & Future Work
- Query set bias: The shared benchmark, while diverse, may still favor certain task families; models excelling on out‑of‑distribution tasks could be missed.
- Scalability to billions of checkpoints: Even with 50× speed‑up, ultra‑large repositories would need hierarchical or distributed bandit strategies.
- Dynamic updates: The current pipeline assumes a static snapshot of models; handling continuous uploads in real time remains an open challenge.
- Beyond accuracy: Future work could incorporate latency, energy consumption, or safety metrics into the multi‑objective bandit formulation.
Bottom line: By treating model discovery as a bandit problem and cleverly reusing evaluation data, the authors show that the “best” models are often hiding in plain sight—and that we now have a practical, scalable way to bring them to the forefront.
Authors
- Jonathan Kahana
- Eliahu Horwitz
- Yedid Hoshen
Paper Information
- arXiv ID: 2601.22157v1
- Categories: cs.LG, cs.CL
- Published: January 29, 2026
- PDF: Download PDF