Karpathy Just Automated the Researcher: What autoresearch Means for the Future of AI Development
Source: Dev.to
By Alex Chen
The Setup – Three Files Do All the Work
| File | Role |
|---|---|
prepare.py | Constants, data‑prep, tokenizer training. Fixed – the agent never touches it. |
train.py | Full GPT model, optimizer (Muon + AdamW), and training loop. Only file the agent edits. |
program.md | Markdown instructions for the agent. Only file the human edits. |
The Loop – Brutally Simple
- Agent reads
program.mdto understand the research org’s goals. - Agent modifies
train.py– architecture, hyper‑parameters, optimizer, batch size, anything. - Training runs for exactly 5 minutes (wall‑clock).
- Metric:
val_bpb(validation bits‑per‑byte) – lower is better. - If improved → keep the change; if not → discard.
- Repeat overnight.
- At ~12 experiments/hour you get roughly 100 experiments while you sleep.
- In the morning you have a log of what the agent tried, what worked, and what didn’t.
Why the 5‑minute budget matters
- Makes every experiment comparable regardless of what the agent changed (model size, sequence length, attention pattern, optimizer settings, etc.).
- Forces autoresearch to optimize specifically for your hardware – the best model in 5 min on an RTX 3090 differs from the best model in 5 min on an H100.
Insight Most Coverage Misses
| Traditional ML Research Workflow | autoresearch Workflow |
|---|---|
| Human reads papers → forms hypothesis → modifies training code → runs experiment → analyzes results → updates mental model → repeat | Human writes program.md (research‑org instructions) → AI agent runs the inner loop indefinitely |
The human has moved up one level of abstraction.
You’re no longer programming Python; you’re programming the research methodology in Markdown. The AI does the Python.
“You are programming the
program.mdMarkdown files that provide context to the AI agents and set up your autonomous research org.” – Karpathy
program.md is your meta‑program. It encodes:
- Your hypotheses about what’s worth trying.
- Evaluation criteria.
- Architectural priors.
The agent is your compiler.
The default program.md in the repo is intentionally bare‑bones – Karpathy leaves it as an open research surface. The obvious next step is to iterate on the research‑org instructions themselves, i.e., meta‑optimizing the “org code” that yields the fastest research progress.
A General Pattern Emerging in Autonomous Systems
observe current state → propose a change → apply the change →
measure outcome against objective → keep if better, discard if worse → repeat- This is hill‑climbing, but at the software‑modification level.
- The agent isn’t just searching hyper‑parameter space; it’s searching over the space of programs that train models.
The same loop appears in recursive self‑improvement (RSI) frameworks:
| Autoresearch | RSI Agent‑Infrastructure |
|---|---|
| Operates on ML experiment code & validation loss | Operates on tool configs, skill files & task success rates |
Search space = train.py (model definition, optimizer, training loop) | Search space = infrastructure scripts, routing logic, skill modules |
Both are try → measure → keep/discard cycles; only the abstraction level differs. This convergence suggests a general principle: the unit of improvement is the experiment, and the researcher’s job is to design the experiment space.
Concrete Search Space in autoresearch
train.py contains the full GPT model definition, the Muon + AdamW optimizer, and the training loop. Everything is fair game:
- Transformer architecture (depth, width, attention heads)
- Attention patterns (default uses “SSSL” – alternating banded attention)
- Optimizer settings & schedules
- Batch size & sequence length
- Regularization strategies
- Any new architecture component the agent wishes to implement
The agent can make arbitrarily creative changes – it isn’t limited to a grid search over predefined parameters. A sufficiently capable agent could:
- Implement flash‑attention variants
- Propose new normalization schemes
- Change positional encodings
The only constraints are the 5‑minute training budget and the single‑file edit scope.
Key point: The search space isn’t predefined. It’s partly defined by
program.mdand partly by the agent’s own code‑generation capabilities. As frontier models improve, the same framework gets more powerful without any infrastructure changes.
What This Means for ML Researchers
Parts automated by autoresearch | Parts still human (for now) |
|---|---|
| Generating implementation hypotheses | Defining the objective metric |
| Writing training code | Designing the evaluation setup |
| Running experiments | Writing program.md – encoding research intuitions |
| Tracking which changes improved performance | Interpreting results at a higher level |
| Avoiding previously‑failed approaches | Deciding what problem to work on |
Notice the pattern: humans retain the high‑level strategic role (goal setting, metric design, problem selection), while the AI handles the low‑level execution (code generation, experiment running, bookkeeping).
TL;DR
- Karpathy’s
autoresearchisn’t just automating experiments; it’s automating the experimenter. - The human writes
program.md– a meta‑program that tells the AI what to explore. - The AI iteratively edits
train.py, runs a 5‑minute training job, and keeps improvements. - This loop mirrors a broader autonomous‑system pattern (observe → propose → apply → measure → keep/discard).
- As models get better, the same minimal infrastructure will scale, pushing researchers further up the abstraction ladder.
If you work on ML research, start thinking about how you’ll program the research methodology rather than the code that implements it. The future of AI development may be less about writing Python and more about writing instructions for autonomous agents that do the heavy lifting for you.
The Goal‑Setting and Interpretation Layers
The execution layer is being automated. This isn’t unique to research — it’s happening across knowledge work broadly. But it’s happening to ML research specifically now, which is ironic given that ML is the technology doing the automating.
Practical Implication
- The skill that matters isn’t “can you implement a transformer?” – that’s increasingly table stakes.
- The skill that matters is “can you write a
program.mdthat produces good research?”- This requires understanding the problem space deeply enough to encode your hypotheses as agent instructions.
- It’s closer to research design than to research execution.
One Underrated Aspect of Autoresearch: Time Economics
Previously, a researcher running experiments overnight was a single researcher running one carefully chosen experiment (high setup cost, limited attention).
Autoresearch turns “overnight” into ≈ 100 experiments, each comparing cleanly to all others within the same fixed time budget.
- The cost of a wrong hypothesis drops dramatically.
- You can afford to include wild ideas in
program.mdbecause the agent will discard them if they don’t work, and you’ll see that in the morning log. - Successful experiments surface automatically.
Shift in the Research Bottleneck
The bottleneck moves from experiment throughput to hypothesis‑generation quality—the very area where frontier models are getting good.
Karpathy’s Framing
“The 10,205th generation of the codebase, a self‑modifying binary grown beyond human comprehension — that’s science fiction, but the trajectory is clearly real.”
What Autoresearch Demonstrates
It isn’t just “AI can write training code.” It shows that the research loop itself—the cycle of
- Hypothesis →
- Implementation →
- Experiment →
- Evaluation →
- Iteration
—can be automated at a level that’s useful right now, on a single GPU, with three files.
The New Meta‑Skill
- Researchers who thrive won’t be the ones who can implement attention most cleanly.
- They’ll be the ones who understand the problem well enough to program the research organization—to write the
program.mdthat encodes the right hypotheses, the right search space, and the right success criteria.
Programming the program, not the program itself.
That’s the new meta‑skill.
Author
Alex Chen builds autonomous agent infrastructure. Opinions are operational, not academic.