The Death of TDD: Why 'Evaluation Engineering' is the New Source Code
Source: Dev.to
I recently watched a Junior Engineer try to write a unit test for an LLM agent.
They were trying to assert that response == "I can help with that". The test failed because the AI replied, “I would be happy to help with that.” The engineer sighed, updated the string, and ran it again. It failed again.
This is the state of AI engineering today: we are trying to force probabilistic systems into deterministic boxes, and it is breaking our workflows.
In traditional software, we write the implementation (parse_date()) and then the test (assert parse_date("2024-01-01") == date(2024, 1, 1)). But with AI, the AI writes the implementation. Our job is no longer to write the logic; our job is to write the exam.
I call this Evaluation Engineering, and it is the most valuable code you will write this year.
The Paradigm Shift: From TDD to Eval‑DD
In the old world, the Human was the Coder and the Machine was the Executor. In the new world, the AI is the Coder and the Human is the Examiner.
You can’t write a unit test that covers every creative variation of an AI’s answer. Instead, you need to shift from Test‑Driven Development (TDD) to Evaluation‑Driven Development (Eval‑DD).
The 3 Pillars of Evaluation Engineering
I’ve built a simple framework to replace my unit tests. It consists of three core components that every AI codebase needs.
1. The Golden Dataset (The “Spec”)
Stop writing prose specifications. They are useless to an LLM. In Eval‑DD, the dataset is the specification.
# The Dataset IS the Spec
dataset.add_case(
id="edge_001",
input="Parse this date: 2024-13-01", # Invalid month
expected_output="ERROR",
tags=["invalid", "edge_case"],
difficulty="hard"
)
This dataset defines exactly what “good” looks like. It is the single source of truth. If the AI passes this dataset, it is ready for production; if it fails, it is not.
2. The Scoring Rubric (The “Judge”)
Most teams grade on binary correctness, but real‑world answers can be correct but toxic or safe but useless. A ScoringRubric class allows multi‑dimensional grading across weighted axes.
rubric = ScoringRubric("Customer Service Rubric", "Evaluates correctness AND tone")
# Correctness is important...
rubric.add_criteria(
dimension="correctness",
weight=0.5,
description="Does it solve the problem?",
evaluator=correctness_evaluator
)
# ...but so is not being a jerk.
rubric.add_criteria(
dimension="tone",
weight=0.4,
description="Is it polite and empathetic?",
evaluator=tone_evaluator
)
If the AI answers “Just click forgot password, duh,” it gets:
- Correctness: 10/10
- Tone: 0/10
- Final Score: 5/10 (Fail)
This captures nuance that a simple assert statement misses.
3. The Evaluation Runner (The “Test Suite”)
A runner executes the AI against the Golden Dataset and grades it with the Rubric, replacing pytest. It reports pass rates and indicates whether your prompt engineering worked.
runner = EvaluationRunner(dataset, rubric, my_ai_function)
results = runner.run(verbose=True)
if results['pass_rate'] > 0.9:
print("🎉 AI meets requirements!")
else:
print("❌ AI needs improvement")
Why This Matters
It changes how you work.
- Write the Rubric first. Before crafting prompts, define what success looks like.
- Iterate on the Prompt, not the Code. When a test fails, tweak the system prompt or few‑shot examples rather than rewriting Python logic.
- The “Source Code” moves. The intellectual property is no longer the wrapper code; the IP is the Evaluation Suite.
The Senior Engineer’s New Job
If you worry about AI taking your coding job, don’t be. The job just changed.
The hard part isn’t generating code anymore (Cursor, Copilot can do that). The hard part is:
- Defining the Golden Dataset (capturing edge cases).
- Tuning the Rubric (encoding engineering judgment into weights).
- Analyzing the Failures (figuring out why the AI messed up).
We are leaving the era of deterministic logic and entering the era of probabilistic engineering. Stop begging your AI to be good via prompts. Start grading it.