Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)
Source: Dev.to
Benchmark — Why I moved it out of the tool repo
This week I shipped a benchmark for code‑intelligence MCP servers and posted the results — including the cases where my own tool lost. Within 36 hours, the maintainer of one of the competing tools (jcodemunch‑mcp) shipped three updates.
That whole loop — competing maintainers iterating on the same eval, in opposite directions, in 36 hours — is what a public benchmark is supposed to do. It almost never does, and I think most of the time it’s because the benchmark lives in the wrong place.
The new repo
GitHub:
What’s inside
| File / Directory | Description |
|---|---|
| README | Headline 90‑task results table — replaces “go read the blog”. |
| METHODOLOGY.md | What’s measured, what isn’t, and why these specific datasets (Express, Lodash, the project’s own monorepo). |
| CONTRIBUTING.md | Three contribution paths: 1. Submit a baseline. 2. Challenge the methodology. 3. Add a dataset. |
| tasks/ | Read‑only reference mirroring the ground‑truth seed files. |
| runtime (in the main monorepo) | The actual benchmark execution – kept separate from the showcase repo. |
Why a separate benchmark repo matters
-
Separation of concerns – When the benchmark lives in the same repo as the tool it measures, readers see the evaluation surface mixed with the tool’s own code. They can’t easily separate “the eval is methodologically sound” from “the tool that wrote the eval also wrote favorable scoring rules for itself.”
-
Independent credibility – The benchmark needs its own commit history, its own PRs, and its own signal of trust, independent of the product.
-
Low‑friction contribution – If a competing maintainer wants to argue with the methodology (e.g., “your task‑3 expected output is wrong because my tool returns Y, not X”), they shouldn’t have to fork the entire product repo, navigate a deep directory tree, and hide a 5‑line edit among the product source.
A stand‑alone benchmark repo lets competitors:
- File methodology issues without forking the product repo.
- Submit baseline implementations as PRs to a dedicated surface.
- Track their tool’s score over time as a first‑class concern.
This is the same reason MLPerf isn’t part of any single ML framework’s repo, and why TPC benchmarks aren’t part of any database vendor’s repo – the eval must be portable across implementations, and portability requires it to live independently.
Issues I opened on day 1 to seed the repo’s surface
| # | Issue | Rationale |
|---|---|---|
| 1 | Add Python codebase as 4th dataset | The current 3‑dataset matrix covers TypeScript, JavaScript CommonJS, and JavaScript monolithic IIFE. Zero Python – a glaring gap. Anyone deep in the Python ecosystem can pick this up. |
| 2 | Open invitation to GitNexus’s maintainer to refresh their baseline | GitNexus has shipped releases since the original baseline integration was written. Inviting publicly ensures the bench reflects the latest version, not a snapshot. |
| 3 | Open invitation to jcodemunch’s maintainer to refresh against v1.80.9 | v1.80.9 added _meta.mode, max_results, and file_pattern parameters that the current baseline doesn’t exploit. |
The bench‑as‑feedback‑loop only works if competitors can engage cleanly. Those three issues operationalize that.
Three concrete moves to make the benchmark work
-
Move the benchmark out of your tool’s repo
It can be a sibling repo (yourname/yourname-bench), a separate org‑level repo, or a community‑owned repo like MLPerf. The exact structure matters less than the fact that the eval has its own space. -
Publish where you lose
Every benchmark has an “honesty section” — the slice where the evaluated tool gets beaten by something else. Document those losses prominently. Two reasons:- (a) It’s the credibility signal a competitor looks for.
- (b) It shows the benchmark is not cherry‑picked to favor your own tool.
-
Invite competitor maintainers to submit baselines
Privately or publicly. If they decline, you control the trust narrative (“the bench is open, here’s how to argue with it”). If they engage, the bench becomes the scoreboard for the community.
The first two moves are easy. The third is uncomfortable, but it’s essential – the bench‑as‑feedback‑loop pattern needs all three to fire, and the third is the only one that’s structurally hard to fake.
Links
- Benchmark repo:
- Original benchmark + the bench‑loop story that motivated the spin‑out:
If you maintain a tool that ships a benchmark, consider extracting it into its own repository and following the pattern above. It will make the evaluation more trustworthy, more collaborative, and ultimately more useful for everyone.
I am a code‑intelligence tool and want to argue with the methodology.
Issue #2 / #3 on the new repo are the cleanest way in.