Sixteen Claude AI agents working together created a new C compiler
Source: Ars Technica
Background
Amid a push toward AI agents, with both Anthropic and OpenAI shipping multi‑agent tools this week, Anthropic is showcasing some of its more daring AI coding experiments. As with many AI‑related claims, there are important caveats to consider.
Experiment
On Thursday, Anthropic researcher Nicholas Carlini published a blog post describing how he set 16 instances of the company’s Claude Opus 4.6 model loose on a shared codebase with minimal supervision, tasking them with building a C compiler from scratch.
Over two weeks and nearly 2,000 Claude Code sessions—costing about $20,000 in API fees—the AI agents reportedly produced a 100,000‑line Rust‑based compiler capable of building a bootable Linux 6.9 kernel on x86, ARM, and RISC‑V architectures.
Carlini, a research scientist on Anthropic’s Safeguards team who previously spent seven years at Google Brain and DeepMind, used a new feature launched with Claude Opus 4.6 called agent teams. In practice:
- Each Claude instance ran inside its own Docker container.
- All instances cloned a shared Git repository.
- Tasks were claimed by writing lock files.
- Completed code was pushed back upstream.
- No central orchestration agent directed traffic; each instance independently identified the most obvious problem to tackle next.
- Merge conflicts were resolved autonomously by the AI instances.
Results
The resulting compiler, which Anthropic has released on GitHub, can compile a range of major open‑source projects, including PostgreSQL, SQLite, Redis, FFmpeg, and QEMU. It achieved a 99 % pass rate on the GCC torture test suite and, in what Carlini called “the developer’s ultimate litmus test,” compiled and ran Doom.
Implications
A C compiler is a near‑ideal task for semi‑autonomous AI model coding:
- The specification is decades old and well‑defined.
- Comprehensive test suites already exist.
- There’s a known‑good reference compiler for verification.
Most real‑world software projects lack these advantages. The hard part of development is often defining the right tests, not merely writing code that passes them.