Did that actually help? Evaluating AI coding assistants with hard numbers

Published: (March 2, 2026 at 03:58 AM EST)
9 min read
Source: Dev.to

Source: Dev.to

Building Better AI Coding Assistants: Introducing Pitlane

You are building a Skill, an MCP server, or a custom prompt strategy that is supposed to make an AI coding assistant better at a specific job. You make a change. The next session feels smoother. The agent seems to reach for the right context at the right time.

But how do you know?

That question came up in two parallel problems.

  • I was iterating on MCP servers to support a coding agent—new tool, new tool definition, new prompting strategy. Each change felt like an improvement. Sessions seemed smoother, but I had no numbers—just vibes.
  • A colleague was doing the same thing from the other side: building and refining AI coding Skills—structured prompt packs that teach the agent how to work in a specific context. Again, a lot of iteration, a lot of gut feel, no hard signal on whether the changes were actually moving the needle.

We joined forces and built something to fix this. The result is Pitlane—named after the place in motorsport where engineers swap parts, adjust the setup, check the telemetry, and find out if the next lap is faster.


The Problem with Vibes

When you change an MCP server or a Skill, you are changing something about the environment the agent operates in. The agent gets different tools, different context, different instructions.

Those changes can have real effects:

  • Pass rates on tasks go up or down.
  • The agent takes fewer wrong turns.
  • Token costs change.
  • Time‑to‑completion changes.
  • Output quality improves or degrades.

Without measurement, you cannot tell which of those things happened. You cannot tell whether the last commit was an improvement or a regression. You cannot tell whether version 3 of your Skill is better than version 1.

You end up making decisions based on a handful of memorable sessions, which is not a reliable signal. Good sessions feel good. Bad sessions get rationalised. The data you are implicitly collecting is not representative.


What You Actually Need

You need to be able to answer a specific, repeatable question:

With my Skill or MCP present, does the agent complete this task better than without it?

That question has a structure:

  1. A defined task with explicit success criteria.
  2. Two configurations – a baseline (without your changes) and a challenger (with them).
  3. Deterministic assertions that verify success independently of the agent’s own judgement.
  4. A way to compare results across runs.

That structure is an eval—not a generic language‑model benchmark, but a benchmark for your Skill or MCP server, in your context, on tasks that actually matter to you.


What Pitlane Is

Pitlane is an open‑source command‑line tool for running those evals.

  • Define tasks in YAML.
  • Configure a baseline and one or more challengers.
  • Race them against each other.

The results give you numbers rather than impressions, so you can see whether your work is paying off.

The Loop

  1. Tune your Skill/MCP.
  2. Race the baseline vs. challenger.
  3. Check the telemetry (pass rate, cost, time, token usage).
  4. Repeat.

Deterministic Assertions

  • File‑existence checks.
  • Command exit codes.
  • Pattern matching.

Either the file is there and valid, or it isn’t—no LLM‑as‑judge, no subjectivity baked into the measurement.

When you need fuzzy matching for documentation or generated content, Pitlane offers similarity metrics (ROUGE, BLEU, BERTScore, cosine similarity) with configurable thresholds. These are deterministic numeric metrics, not a second model grading your output.

Handling Non‑Determinism

Because agent outputs are non‑deterministic, Pitlane supports repeated runs with aggregated statistics (average, min, max, standard deviation). A Skill that reliably pushes a hard task from 50 % to 70 % pass rate is a meaningful result. A single‑run “improvement” could just be variance.

Multi‑Metric Reporting

Pitlane tracks:

MetricWhat It Shows
Pass rateSuccess frequency
CostToken usage / monetary cost
TimeWall‑clock duration
Token usageRaw token count

A Skill that improves pass rate by 5 % while tripling cost is a different trade‑off than one that achieves the same improvement at the same cost. Both columns appear in the HTML report so you can see the full picture.

Current Provider Support

  • Claude Code
  • Mistral Vibe
  • OpenCode
  • IBM Bob

(as of the time of writing)


Why Not Use an Existing Eval Tool?

There are good, widely‑used tools in this space: promptfoo, Braintrust, LangSmith, DeepEval, and others. They solve real problems. The question is whether they solve this problem without requiring you to build the scaffolding yourself.

Promptfoo as a Representative Example

Promptfoo is mature, well‑documented, and genuinely extensible. It runs real agent sessions via its Claude Agent SDK and Codex SDK providers, so the agent actually executes and files get written. So far, so good.

The Gap: Assertion Layer

Promptfoo’s built‑in assertions are primarily oriented around validating the agent’s returned text. In their coding‑agent guide, one example verification pattern is a JavaScript assertion that parses the agent’s final text for keywords like “passed” or “success”:

const text = String(output).toLowerCase();
const passed = text.includes('passed') || text.includes('success');

That assertion passes when the agent says the tests passed—it does not verify that the tests actually passed. A model that narrates success while producing broken code would pass; a model that silently produces correct code with a terse “done” might fail. This is fine for some workflows, but it is not the same as asserting on the produced artifacts as first‑class primitives.

Promptfoo’s JavaScript assertion API is powerful enough to do better—you can require('fs') and require('child_process') and wire up real filesystem checks yourself. However, you end up writing boilerplate for every benchmark, managing your own working‑directory scoping, and handling fixture isolation manually. Their documentation even acknowledges the gap:

“The agent’s output is treated as the source of truth; if you need to verify side‑effects (files written, commands executed), you must implement that logic yourself.”

Pitlane was built to fill exactly that gap: first‑class, deterministic assertions on side‑effects, with built‑in support for repeated runs, statistical aggregation, and multi‑metric reporting.


TL;DR

  • Problem: Iterating on Skills/MCP servers without quantitative feedback leads to decisions based on vague “vibes.”
  • Solution: Pitlane—a CLI tool that runs deterministic, repeatable evals comparing baseline vs. challenger configurations on real tasks.
  • Benefits: Objective pass‑rate, cost, and time metrics; deterministic assertions (file existence, exit codes, pattern matches); optional fuzzy similarity metrics; statistical aggregation across runs; HTML reports for quick insight.
  • Why not just use existing tools? Existing tools either focus on LLM‑generated text or require you to write a lot of boilerplate to assert on side‑effects. Pitlane gives you that out of the box.

Give Pitlane a spin, and turn those “vibes” into hard data you can trust. 🚀

Benchmarks That Don’t Lie to You

Measurement helps, but it can also mislead. Three failure modes are worth keeping in mind.

Gaming Your Own Benchmark

When a metric becomes a target, behavior adjusts to hit the target rather than the underlying goal.

  • Baseline/challenger structure – you’re not asking “does this pass?” in isolation; you’re asking “does this beat the baseline.”
  • Diverse task set – include tasks your Skill wasn’t specifically designed for. If adjacent tasks regress when your target tasks improve, you have a problem.

Pass Rate Is a Goal Metric, Not the Whole Picture

Pass rate tells you whether the output was correct, but it doesn’t tell you what it cost to get there.

  • Pitlane tracks tokens, cost, and time alongside pass rates.
  • A Skill that lifts pass rate from 60 % to 80 % while doubling token cost is a different trade‑off than one that achieves the same improvement at the same cost.
  • The weighted score is distinct from the binary pass rate – a task where the critical assertion is weighted 3× tells a different story than a flat count.

Your Context Is Not Someone Else’s Context

A generic benchmark shows how an assistant performs on generic tasks, but the meaningful signal comes from tasks you write yourself, against fixture directories that reflect your actual project structure, with assertions that match what “done” means in your specific context.

Borrowing a benchmark wholesale and optimizing against it is still measuring someone else’s problem.


What This Changes

The question “Is this actually better?” becomes answerable.

  • When you add a new tool to an MCP server, you can benchmark before and after and see whether the task that motivated the tool now passes more reliably.
  • When you tighten a prompt in a Skill, you can see whether that tightening broke anything on tasks that previously passed.

Without measurement, every change is a vibe. With measurement, you have a signal. The signal is not perfect—benchmarks can be gamed, task sets can be incomplete, and improvements on a small task set may not generalize. But noisy measurement beats no measurement. You can improve your task set over time; you cannot improve intuition alone.

The lap times do not lie.


Try It

Pitlane is open source, takes a few minutes to set up, and is documented at the repository:

https://github.com/pitlane-ai/pitlane

If you are building MCP servers or AI‑coding Skills and you want hard numbers instead of gut feel, this is the tool. We built it because we needed it, and we would rather more people be measuring than guessing.

  • Find a gap? Open an issue.
  • Add support for a new assistant or improve an existing one? Send a PR.

The codebase is Python, the architecture is straightforward, and contributions are welcome.

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...