Did that actually help? Evaluating AI coding assistants with hard numbers

Published: 2 months ago (March 2, 2026 at 03:58 AM EST)

9 min read

Source: Dev.to

Source: Dev.to

Source: Dev.to

Building Better AI Coding Assistants: Introducing Pitlane

You are building a Skill, an MCP server, or a custom prompt strategy that is supposed to make an AI coding assistant better at a specific job. You make a change. The next session feels smoother. The agent seems to reach for the right context at the right time.

But how do you know?

That question came up in two parallel problems.

I was iterating on MCP servers to support a coding agent—new tool, new tool definition, new prompting strategy. Each change felt like an improvement. Sessions seemed smoother, but I had no numbers—just vibes.
A colleague was doing the same thing from the other side: building and refining AI coding Skills—structured prompt packs that teach the agent how to work in a specific context. Again, a lot of iteration, a lot of gut feel, no hard signal on whether the changes were actually moving the needle.

We joined forces and built something to fix this. The result is Pitlane—named after the place in motorsport where engineers swap parts, adjust the setup, check the telemetry, and find out if the next lap is faster.

The Problem with Vibes

When you change an MCP server or a Skill, you are altering the environment in which the agent operates. The agent receives different tools, context, and instructions.

Those changes can have real effects:

Pass rates on tasks go up or down.
The agent takes fewer wrong turns.
Token costs change.
Time‑to‑completion changes.
Output quality improves or degrades.

Without measurement, you cannot tell which of those things happened. You cannot determine whether the last commit was an improvement or a regression, nor whether version 3 of your Skill is better than version 1.

Consequently, decisions are made based on a handful of memorable sessions, which is not a reliable signal. Good sessions feel good; bad sessions get rationalized. The data you are implicitly collecting is not representative.

What You Actually Need

You need to be able to answer a specific, repeatable question:

With my Skill or MCP present, does the agent complete this task better than without it?

That question has a structure:

A defined task with explicit success criteria.
Two configurations – a baseline (without your changes) and a challenger (with them).
Deterministic assertions that verify success independently of the agent’s own judgement.
A way to compare results across runs.

That structure is an eval—not a generic language‑model benchmark, but a benchmark for your Skill or MCP server, in your context, on tasks that actually matter to you.

What Pitlane Is

Pitlane is an open‑source command‑line tool for running those evals.

Define tasks in YAML.
Configure a baseline and one or more challengers.
Race them against each other.

The results give you numbers rather than impressions, so you can see whether your work is paying off.

The Loop

Tune your Skill/MCP.
Race the baseline vs. challenger.
Check the telemetry (pass rate, cost, time, token usage).
Repeat.

Deterministic Assertions

File‑existence checks.
Command exit codes.
Pattern matching.

Either the file is there and valid, or it isn’t—no LLM‑as‑judge, no subjectivity baked into the measurement.

When you need fuzzy matching for documentation or generated content, Pitlane offers similarity metrics (ROUGE, BLEU, BERTScore, cosine similarity) with configurable thresholds. These are deterministic numeric metrics, not a second model grading your output.

Handling Non‑Determinism

Because agent outputs are non‑deterministic, Pitlane supports repeated runs with aggregated statistics (average, min, max, standard deviation). A Skill that reliably pushes a hard task from 50 % to 70 % pass rate is a meaningful result. A single‑run “improvement” could just be variance.

Multi‑Metric Reporting

Pitlane tracks:

Metric	What It Shows
Pass rate	Success frequency
Cost	Token usage / monetary cost
Time	Wall‑clock duration
Token usage	Raw token count

A Skill that improves pass rate by 5 % while tripling cost is a different trade‑off than one that achieves the same improvement at the same cost. Both columns appear in the HTML report so you can see the full picture.

Current Provider Support

Claude Code
Mistral Vibe
OpenCode
IBM Bob

(as of the time of writing)

Why Not Use an Existing Eval Tool?

There are good, widely‑used tools in this space—promptfoo, Braintrust, LangSmith, DeepEval, and others. They solve real problems. The question is whether they solve this problem without requiring you to build the scaffolding yourself.

Promptfoo as a Representative Example

Promptfoo is mature, well‑documented, and genuinely extensible. It runs real agent sessions via its Claude Agent SDK and Codex SDK providers, so the agent actually executes and files get written. So far, so good.

The Gap: Assertion Layer

Promptfoo’s built‑in assertions are primarily oriented around validating the agent’s returned text. In their coding‑agent guide, one example verification pattern is a JavaScript assertion that parses the agent’s final text for keywords like “passed” or “success”:

const text = String(output).toLowerCase();
const passed = text.includes('passed') || text.includes('success');

That assertion passes when the agent says the tests passed—it does not verify that the tests actually passed.

A model that narrates success while producing broken code would pass.
A model that silently produces correct code with a terse “done” might fail.

This is fine for some workflows, but it is not the same as asserting on the produced artifacts as first‑class primitives.

Promptfoo’s JavaScript assertion API is powerful enough to do better—you can require('fs') and require('child_process') and wire up real filesystem checks yourself. However, you end up:

Writing boilerplate for every benchmark.
Managing your own working‑directory scoping.
Handling fixture isolation manually.

Their documentation even acknowledges the gap:

“The agent’s output is treated as the source of truth; if you need to verify side‑effects (files written, commands executed), you must implement that logic yourself.”

Pitlane was built to fill exactly that gap: first‑class, deterministic assertions on side‑effects, with built‑in support for repeated runs, statistical aggregation, and multi‑metric reporting.

TL;DR

Problem – Iterating on Skills/MCP servers without quantitative feedback forces decisions to rely on vague “vibes.”
Solution – Pitlane: a CLI tool that runs deterministic, repeatable evaluations comparing a baseline against a challenger configuration on real tasks.
Benefits
- Objective metrics: pass‑rate, cost, and execution time.
- Deterministic assertions (file existence, exit codes, pattern matches).
- Optional fuzzy‑similarity metrics.
- Statistical aggregation across multiple runs.
- HTML reports for quick insight.
Why not just use existing tools?
- Most tools focus on LLM‑generated text or require extensive boilerplate to assert side‑effects.
- Pitlane provides these capabilities out‑of‑the‑box.

Give Pitlane a spin and turn those “vibes” into hard data you can trust. 🚀

Benchmarks That Don’t Lie to You

Measurement helps, but it can also mislead. Three failure modes are worth keeping in mind.

1. Gaming Your Own Benchmark

When a metric becomes a target, behavior adjusts to hit the target rather than the underlying goal.

Baseline / challenger structure – instead of asking “does this pass?” in isolation, ask “does this beat the baseline?”
Diverse task set – include tasks your Skill wasn’t specifically designed for. If adjacent tasks regress when your target tasks improve, you have a problem.

2. Pass Rate Is a Goal Metric, Not the Whole Picture

Pass rate tells you whether the output was correct, but it doesn’t tell you what it cost to get there.

Pitlane tracks tokens, cost, and time alongside pass rates.
A Skill that lifts pass rate from 60 % to 80 % while doubling token cost is a different trade‑off than one that achieves the same improvement at the same cost.
The weighted score is distinct from the binary pass rate – a task where the critical assertion is weighted 3× tells a different story than a flat‑count task.

3. Your Context Is Not Someone Else’s Context

A generic benchmark shows how an assistant performs on generic tasks, but the meaningful signal comes from tasks you write yourself, against fixture directories that reflect your actual project structure, with assertions that match what “done” means in your specific context.

Borrowing a benchmark wholesale and optimizing against it is still measuring someone else’s problem.

What This Changes

The question “Is this actually better?” becomes answerable.

When you add a new tool to an MCP server, you can benchmark before and after and see whether the task that motivated the tool now passes more reliably.
When you tighten a prompt in a Skill, you can see whether that tightening broke anything on tasks that previously passed.

Without measurement, every change is a vibe. With measurement, you have a signal. The signal is not perfect—benchmarks can be gamed, task sets can be incomplete, and improvements on a small task set may not generalize. But noisy measurement beats no measurement. You can improve your task set over time; you cannot improve intuition alone.

The lap times do not lie.

## Try It

Pitlane is open source, takes only a few minutes to set up, and is documented in the repository:

[https://github.com/pitlane-ai/pitlane](https://github.com/pitlane-ai/pitlane)

If you are building MCP servers or AI‑coding skills and want hard numbers instead of gut feel, this is the tool. We built it because we needed it, and we’d rather more people be measuring than guessing.

- **Find a gap?** Open an issue.  
- **Add support for a new assistant or improve an existing one?** Send a PR.

The codebase is Python, the architecture is straightforward, and contributions are welcome.