The Fixed Metric
Source: Dev.to
The Core Insight
Andrej Karpathy’s autoresearch repository has exactly three files that matter:
| File | Role |
|---|---|
| Domain file 1 | The agent edits the training code, runs five‑minute experiments, checks validation loss, decides whether to keep or discard the result, and loops – all night, without you. |
| Domain file 2 | Same as above – the second “agent domain” file. |
program.md | Your file. You write it once, then go to sleep. The agent never touches it, and you never touch the agent’s files. It is the arena – the interface between you and the agent. |
program.mdis the most important artifact in the system.
It is not the training code, nor the model architecture. It is the document that tells the agent how to think about the problem.
Why the Fixed 5‑Minute Budget Matters
- Every run gets exactly 5 minutes – not “roughly 5 minutes”, not “5 minutes unless the experiment looks promising”.
- The same vocabulary‑size‑independent metric is used for every run.
This fixed budget does most of the heavy lifting:
- Without it you would have 100 data points that can’t be meaningfully compared.
- A run that trains for 8 min on a smaller architecture versus a run that trains for 3 min on a larger one – which is better? You can’t know; the comparison breaks.
The fixed metric is what makes the arena work. It is the one thing the agent cannot change. The agent may modify architecture, optimizer, hyper‑parameters, etc., but the evaluation criterion stays human‑defined.
Design Question for Every Arena: What is the one thing that cannot move?
Without an anchoring criterion, the arena has no walls. The agent optimizes toward a target that is also shifting, leaving you with a sea of incomparable experiments.
Patterns Across Domains
| Domain | Artifact (Human‑written) | Purpose |
|---|---|---|
| ML research | program.md | Tells the agent how to think about research |
| Code generation (Claude Code) | CLAUDE.md | Logical router: hard rules, decision trees, end conditions |
| Task definition | TASK_CONTRACT.md | Defines “done” before a session starts |
| Governance | Governance framework (policy docs) | Defines authority boundaries, escalation paths, audit trails |
All of these are specifications that shape how the agent thinks – the primary artifact of human creative work.
Concrete Examples
1. Team Workflow with Modes
A team running agents on a shared codebase defines mode separation:
- Discover – explore the problem space.
- Plan – outline a solution.
- Build – implement the plan.
- Review – evaluate the output.
- Bugfix – fix issues.
- Loop – iterate.
Each mode is a different arena with its own rules. The agent cannot jump straight to Build; it must pass through Discover first. The modes act as walls preventing the agent from running indefinitely in the wrong direction.
2. Organizational Governance
An organization deploying agents across workflows builds a governance framework:
- Authorization scope – what the agent can do alone.
- Escalation paths – when human sign‑off is required.
- Audit trails – logging of decisions and actions.
This framework is an arena at institutional scale, providing the same anchoring principle as program.md.
3. Failure Without an Arena
Victor Taelin’s experiment
Budget: $1,000, 2 days.
Outcome: 18 rounds of wrong work.
There was no fixed metric, no anchoring criterion, no arena. The agents optimized efficiently toward the wrong answer because nothing in the system could tell them they were answering the wrong question. This is the failure mode the arena prevents.
The Discipline Lies in the Specification
- The discipline is not in the agent; it is in the specification (the human‑written file).
- Karpathy’s
program.mdis already ~90 % AI‑written, but the anchor – the validation metricval_bpb– does not move. The agent can rewrite everything else, but the evaluation criterion stays human‑defined.
Recursive Self‑Improvement Principle
Design Principle: Recursive self‑improvement requires at least one layer that cannot self‑improve.
The evaluation criterion for a lower layer must come from a higher, immutable layer.
Foundation’s answer to this is provenance: the evaluator decides what knowledge gets promoted to semantic memory, but the criteria themselves remain fixed.
Takeaways
- Identify the immutable metric (the arena’s wall) before letting agents iterate.
- Write a single, clear specification file (
program.md,CLAUDE.md,TASK_CONTRACT.md, etc.) that encodes this metric and the high‑level strategy. - Allow agents to freely modify everything else (code, hyper‑parameters, architecture) while respecting the fixed evaluation criterion.
- Use mode separation or governance frameworks to create additional walls at larger scales.
When the arena is well‑defined, agents become powerful research partners; without it, they are just fast, but directionless, automata.
The Evolution of the Bottleneck
Evaluation is human‑defined and human‑maintained.
The agent generates knowledge candidates, but humans decide what counts as knowledge worth keeping. That layer does not automate.
The loop can run indefinitely, but the anchoring criterion must stay in place.
From Code to Prompt to program.md
- A year ago the bottleneck was writing the code.
- Then it became writing the prompt.
- Now the bottleneck is writing
program.md– the document that tells the agent how to think about the entire problem domain.
“The best labs won’t have the most engineers. They’ll have the best
program.md.”
Why This Matters
- The leverage available to a single person with a well‑designed arena is genuinely enormous.
- Karpathy runs 100 ML experiments overnight on one GPU.
- A developer with a good
CLAUDE.mdships what used to require a whole team. - A researcher with a precise specification wakes up to 15 kept improvements from 83 experiments she didn’t run herself.
But the bottleneck moved – it didn’t disappear.
What Writing program.md Really Requires
- Deep understanding of the problem domain – you must define what success looks like before the experiments begin.
- Choosing the right metric and knowing why it matters.
- Building walls that are:
- Tight enough to prevent drift,
- Loose enough to allow genuine discovery.
That’s harder than writing code.
- Code either compiles or it doesn’t.
- An arena either produces useful experiments or it produces a hundred comparable runs toward the wrong answer – and you might not know which until you’ve slept through several nights of agent activity.
The Real Shift
The shift is not from hard work to easy work.
It is from the hard work of execution to the harder work of specification.
- The agent handles the experiments.
- The human handles the judgment about what the experiments should be measuring.
That judgment doesn’t automate. The loop runs, but the arena stays yours.
Series Context
This post is part of a series on what AI actually changes in software development.
Previous pieces:
- The Gatekeeping Panic
- The Meter Was Always Running
- Who Said What to Whom
- The Token Economy
- I Shipped Broken Code and Wrote an Article About It