The secret isn't the model. It's the harness.
Source: Dev.to
Introduction
Getting AI agents to write code is no longer new. The real challenge isn’t how smart the model is, but that agents lack robust, long‑running environments.
Harness Engineering
Harness Engineering is the discipline focused on building those environments.
- Anthropic published a blog post in November 2025 describing “effective harnesses for long‑running agents.” [link]
- OpenAI released a similar post in February 2026. [link]
OpenAI reported that a team of seven people generated about 1 million lines of code across 1,500 pull requests in five months—without writing a single line by hand (self‑reported).
On X, the post “the 10x skill of 2026 is Evaluation Engineering” went viral, highlighting the shift from “writing code” to “building environments where agents write good code.”
Agent Harness
The Agent Harness handles execution:
- Automates environment setup.
- Passes progress between sessions using progress files and Git.
- Builds one feature at a time.
- Runs end‑to‑end (E2E) tests automatically.
Evaluation Harness
The Evaluation Harness provides quantitative scoring of AI output:
- EleutherAI maintains 60+ benchmarks.
- Inspect AI offers 100+ pre‑built evaluations.
- LLM‑as‑a‑judge lets AI grade AI.
- These evaluations integrate with CI/CD gates and safety testing (e.g., MLCommons AILuminate’s 59,624 test prompts).
Anthropic’s Two‑Step System
- Setup Agent creates
init.shand a feature list in JSON. - Coding Agent iterates over each feature:
- Write code.
- Write tests.
- Commit changes.
- Repeat.
Progress is persisted via claude-progress.txt and Git history.
The repository includes AGENTS.md (≈ 100 lines) that defines the rules for the entire codebase. Custom linters and CI enforce these rules automatically, removing the need to embed constraints in prompts.
OpenAI’s Approach
OpenAI’s environment is highly customized for a single repository. It emphasizes:
- Tight integration with the repo’s CI pipeline.
- Automated generation of PRs and test suites.
Because it is tailored to one project, it cannot be directly copied to other codebases without substantial adaptation.
Comparison and Limitations
| Aspect | Anthropic | OpenAI |
|---|---|---|
| Target domain | Full‑stack web development | Single, highly customized repo |
| Portability | More generic, but still web‑focused | Low – not directly reusable |
| Scalability | Breaks work into small steps, enforces repo‑wide rules | Relies on bespoke tooling per project |
| Untested areas | Scientific research, financial modeling | Other domains beyond the original repo |
Both companies converge on the same core principles:
- Store knowledge in the repository.
- Enforce rules with tooling (linters, CI).
- Decompose work into small, traceable steps.
Conclusion
Models will continue to become smarter, but even the most advanced model cannot sustain long‑running development without a well‑designed environment. The decisive factor isn’t the choice of model; it’s how you build the harness that supports it.
I cover AI agent designs, skills, and context engineering from the perspective of integrating AI into real teams and workflows. Analysis grounded in primary sources.
Follow for more: