The Fixed Metric

Published: 8 hours ago (March 9, 2026 at 09:42 AM EDT)

6 min read

Source: Dev.to

The Core Insight

Andrej Karpathy’s autoresearch repository has exactly three files that matter:

File	Role
Domain file 1	The agent edits the training code, runs five‑minute experiments, checks validation loss, decides whether to keep or discard the result, and loops – all night, without you.
Domain file 2	Same as above – the second “agent domain” file.
`program.md`	Your file. You write it once, then go to sleep. The agent never touches it, and you never touch the agent’s files. It is the arena – the interface between you and the agent.

program.md is the most important artifact in the system.
It is not the training code, nor the model architecture. It is the document that tells the agent how to think about the problem.

Why the Fixed 5‑Minute Budget Matters

Every run gets exactly 5 minutes – not “roughly 5 minutes”, not “5 minutes unless the experiment looks promising”.
The same vocabulary‑size‑independent metric is used for every run.

This fixed budget does most of the heavy lifting:

Without it you would have 100 data points that can’t be meaningfully compared.
A run that trains for 8 min on a smaller architecture versus a run that trains for 3 min on a larger one – which is better? You can’t know; the comparison breaks.

The fixed metric is what makes the arena work. It is the one thing the agent cannot change. The agent may modify architecture, optimizer, hyper‑parameters, etc., but the evaluation criterion stays human‑defined.

Design Question for Every Arena: What is the one thing that cannot move?

Without an anchoring criterion, the arena has no walls. The agent optimizes toward a target that is also shifting, leaving you with a sea of incomparable experiments.

Patterns Across Domains

Domain	Artifact (Human‑written)	Purpose
ML research	`program.md`	Tells the agent how to think about research
Code generation (Claude Code)	`CLAUDE.md`	Logical router: hard rules, decision trees, end conditions
Task definition	`TASK_CONTRACT.md`	Defines “done” before a session starts
Governance	Governance framework (policy docs)	Defines authority boundaries, escalation paths, audit trails

All of these are specifications that shape how the agent thinks – the primary artifact of human creative work.

Concrete Examples

1. Team Workflow with Modes

A team running agents on a shared codebase defines mode separation:

Discover – explore the problem space.
Plan – outline a solution.
Build – implement the plan.
Review – evaluate the output.
Bugfix – fix issues.
Loop – iterate.

Each mode is a different arena with its own rules. The agent cannot jump straight to Build; it must pass through Discover first. The modes act as walls preventing the agent from running indefinitely in the wrong direction.

2. Organizational Governance

An organization deploying agents across workflows builds a governance framework:

Authorization scope – what the agent can do alone.
Escalation paths – when human sign‑off is required.
Audit trails – logging of decisions and actions.

This framework is an arena at institutional scale, providing the same anchoring principle as program.md.

3. Failure Without an Arena

Victor Taelin’s experiment
Budget: $1,000, 2 days.
Outcome: 18 rounds of wrong work.

There was no fixed metric, no anchoring criterion, no arena. The agents optimized efficiently toward the wrong answer because nothing in the system could tell them they were answering the wrong question. This is the failure mode the arena prevents.

The Discipline Lies in the Specification

The discipline is not in the agent; it is in the specification (the human‑written file).
Karpathy’s program.md is already ~90 % AI‑written, but the anchor – the validation metric val_bpb – does not move. The agent can rewrite everything else, but the evaluation criterion stays human‑defined.

Recursive Self‑Improvement Principle

Design Principle: Recursive self‑improvement requires at least one layer that cannot self‑improve.
The evaluation criterion for a lower layer must come from a higher, immutable layer.

Foundation’s answer to this is provenance: the evaluator decides what knowledge gets promoted to semantic memory, but the criteria themselves remain fixed.

Takeaways

Identify the immutable metric (the arena’s wall) before letting agents iterate.
Write a single, clear specification file (program.md, CLAUDE.md, TASK_CONTRACT.md, etc.) that encodes this metric and the high‑level strategy.
Allow agents to freely modify everything else (code, hyper‑parameters, architecture) while respecting the fixed evaluation criterion.
Use mode separation or governance frameworks to create additional walls at larger scales.

When the arena is well‑defined, agents become powerful research partners; without it, they are just fast, but directionless, automata.

The Evolution of the Bottleneck

Evaluation is human‑defined and human‑maintained.
The agent generates knowledge candidates, but humans decide what counts as knowledge worth keeping. That layer does not automate.

The loop can run indefinitely, but the anchoring criterion must stay in place.

From Code to Prompt to program.md

A year ago the bottleneck was writing the code.
Then it became writing the prompt.
Now the bottleneck is writing program.md – the document that tells the agent how to think about the entire problem domain.

“The best labs won’t have the most engineers. They’ll have the best program.md.”

Why This Matters

The leverage available to a single person with a well‑designed arena is genuinely enormous.
Karpathy runs 100 ML experiments overnight on one GPU.
A developer with a good CLAUDE.md ships what used to require a whole team.
A researcher with a precise specification wakes up to 15 kept improvements from 83 experiments she didn’t run herself.

But the bottleneck moved – it didn’t disappear.

What Writing `program.md` Really Requires

Deep understanding of the problem domain – you must define what success looks like before the experiments begin.
Choosing the right metric and knowing why it matters.
Building walls that are:
- Tight enough to prevent drift,
- Loose enough to allow genuine discovery.

That’s harder than writing code.

Code either compiles or it doesn’t.
An arena either produces useful experiments or it produces a hundred comparable runs toward the wrong answer – and you might not know which until you’ve slept through several nights of agent activity.

The Real Shift

The shift is not from hard work to easy work.
It is from the hard work of execution to the harder work of specification.

The agent handles the experiments.
The human handles the judgment about what the experiments should be measuring.

That judgment doesn’t automate. The loop runs, but the arena stays yours.

Series Context

This post is part of a series on what AI actually changes in software development.
Previous pieces:

The Gatekeeping Panic
The Meter Was Always Running
Who Said What to Whom
The Token Economy
I Shipped Broken Code and Wrote an Article About It

The Fixed Metric

The Core Insight

Why the Fixed 5‑Minute Budget Matters

Patterns Across Domains

Concrete Examples

1. Team Workflow with Modes

2. Organizational Governance

3. Failure Without an Arena

The Discipline Lies in the Specification

Recursive Self‑Improvement Principle

Takeaways

The Evolution of the Bottleneck

From Code to Prompt to program.md

Why This Matters

What Writing `program.md` Really Requires

The Real Shift

Series Context

Related posts

How to Build a Custom MCP Tool in Under 10 Min

I Built an Open-Source CLI That Diagnoses Production Incidents in 30 Seconds — Looking for Contributors

Your AI Agent Is Dumpster Diving Through Your Code,,,

Watson's Contract Problem: What AI Teaches Us About Tech Debt

The Core Insight

Why the Fixed 5‑Minute Budget Matters

Patterns Across Domains

Concrete Examples

1. Team Workflow with Modes

2. Organizational Governance

3. Failure Without an Arena

The Discipline Lies in the Specification

Recursive Self‑Improvement Principle

Takeaways

The Evolution of the Bottleneck

From Code to Prompt to program.md

Why This Matters

What Writing program.md Really Requires

The Real Shift

Series Context

Related posts

How to Build a Custom MCP Tool in Under 10 Min

I Built an Open-Source CLI That Diagnoses Production Incidents in 30 Seconds — Looking for Contributors

Your AI Agent Is Dumpster Diving Through Your Code,,,

Watson's Contract Problem: What AI Teaches Us About Tech Debt

What Writing `program.md` Really Requires