The Fixed Metric

Published: (March 9, 2026 at 09:42 AM EDT)
6 min read
Source: Dev.to

Source: Dev.to

The Core Insight

Andrej Karpathy’s autoresearch repository has exactly three files that matter:

FileRole
Domain file 1The agent edits the training code, runs five‑minute experiments, checks validation loss, decides whether to keep or discard the result, and loops – all night, without you.
Domain file 2Same as above – the second “agent domain” file.
program.mdYour file. You write it once, then go to sleep. The agent never touches it, and you never touch the agent’s files. It is the arena – the interface between you and the agent.

program.md is the most important artifact in the system.
It is not the training code, nor the model architecture. It is the document that tells the agent how to think about the problem.

Why the Fixed 5‑Minute Budget Matters

  • Every run gets exactly 5 minutes – not “roughly 5 minutes”, not “5 minutes unless the experiment looks promising”.
  • The same vocabulary‑size‑independent metric is used for every run.

This fixed budget does most of the heavy lifting:

  • Without it you would have 100 data points that can’t be meaningfully compared.
  • A run that trains for 8 min on a smaller architecture versus a run that trains for 3 min on a larger one – which is better? You can’t know; the comparison breaks.

The fixed metric is what makes the arena work. It is the one thing the agent cannot change. The agent may modify architecture, optimizer, hyper‑parameters, etc., but the evaluation criterion stays human‑defined.

Design Question for Every Arena: What is the one thing that cannot move?

Without an anchoring criterion, the arena has no walls. The agent optimizes toward a target that is also shifting, leaving you with a sea of incomparable experiments.

Patterns Across Domains

DomainArtifact (Human‑written)Purpose
ML researchprogram.mdTells the agent how to think about research
Code generation (Claude Code)CLAUDE.mdLogical router: hard rules, decision trees, end conditions
Task definitionTASK_CONTRACT.mdDefines “done” before a session starts
GovernanceGovernance framework (policy docs)Defines authority boundaries, escalation paths, audit trails

All of these are specifications that shape how the agent thinks – the primary artifact of human creative work.

Concrete Examples

1. Team Workflow with Modes

A team running agents on a shared codebase defines mode separation:

  1. Discover – explore the problem space.
  2. Plan – outline a solution.
  3. Build – implement the plan.
  4. Review – evaluate the output.
  5. Bugfix – fix issues.
  6. Loop – iterate.

Each mode is a different arena with its own rules. The agent cannot jump straight to Build; it must pass through Discover first. The modes act as walls preventing the agent from running indefinitely in the wrong direction.

2. Organizational Governance

An organization deploying agents across workflows builds a governance framework:

  • Authorization scope – what the agent can do alone.
  • Escalation paths – when human sign‑off is required.
  • Audit trails – logging of decisions and actions.

This framework is an arena at institutional scale, providing the same anchoring principle as program.md.

3. Failure Without an Arena

Victor Taelin’s experiment
Budget: $1,000, 2 days.
Outcome: 18 rounds of wrong work.

There was no fixed metric, no anchoring criterion, no arena. The agents optimized efficiently toward the wrong answer because nothing in the system could tell them they were answering the wrong question. This is the failure mode the arena prevents.

The Discipline Lies in the Specification

  • The discipline is not in the agent; it is in the specification (the human‑written file).
  • Karpathy’s program.md is already ~90 % AI‑written, but the anchor – the validation metric val_bpbdoes not move. The agent can rewrite everything else, but the evaluation criterion stays human‑defined.

Recursive Self‑Improvement Principle

Design Principle: Recursive self‑improvement requires at least one layer that cannot self‑improve.
The evaluation criterion for a lower layer must come from a higher, immutable layer.

Foundation’s answer to this is provenance: the evaluator decides what knowledge gets promoted to semantic memory, but the criteria themselves remain fixed.

Takeaways

  1. Identify the immutable metric (the arena’s wall) before letting agents iterate.
  2. Write a single, clear specification file (program.md, CLAUDE.md, TASK_CONTRACT.md, etc.) that encodes this metric and the high‑level strategy.
  3. Allow agents to freely modify everything else (code, hyper‑parameters, architecture) while respecting the fixed evaluation criterion.
  4. Use mode separation or governance frameworks to create additional walls at larger scales.

When the arena is well‑defined, agents become powerful research partners; without it, they are just fast, but directionless, automata.

The Evolution of the Bottleneck

Evaluation is human‑defined and human‑maintained.
The agent generates knowledge candidates, but humans decide what counts as knowledge worth keeping. That layer does not automate.

The loop can run indefinitely, but the anchoring criterion must stay in place.

From Code to Prompt to program.md

  • A year ago the bottleneck was writing the code.
  • Then it became writing the prompt.
  • Now the bottleneck is writing program.md – the document that tells the agent how to think about the entire problem domain.

“The best labs won’t have the most engineers. They’ll have the best program.md.”

Why This Matters

  • The leverage available to a single person with a well‑designed arena is genuinely enormous.
  • Karpathy runs 100 ML experiments overnight on one GPU.
  • A developer with a good CLAUDE.md ships what used to require a whole team.
  • A researcher with a precise specification wakes up to 15 kept improvements from 83 experiments she didn’t run herself.

But the bottleneck moved – it didn’t disappear.

What Writing program.md Really Requires

  1. Deep understanding of the problem domain – you must define what success looks like before the experiments begin.
  2. Choosing the right metric and knowing why it matters.
  3. Building walls that are:
    • Tight enough to prevent drift,
    • Loose enough to allow genuine discovery.

That’s harder than writing code.

  • Code either compiles or it doesn’t.
  • An arena either produces useful experiments or it produces a hundred comparable runs toward the wrong answer – and you might not know which until you’ve slept through several nights of agent activity.

The Real Shift

The shift is not from hard work to easy work.
It is from the hard work of execution to the harder work of specification.

  • The agent handles the experiments.
  • The human handles the judgment about what the experiments should be measuring.

That judgment doesn’t automate. The loop runs, but the arena stays yours.

Series Context

This post is part of a series on what AI actually changes in software development.
Previous pieces:

  • The Gatekeeping Panic
  • The Meter Was Always Running
  • Who Said What to Whom
  • The Token Economy
  • I Shipped Broken Code and Wrote an Article About It
0 views
Back to Blog

Related posts

Read more »