The secret isn't the model. It's the harness.

Published: (March 7, 2026 at 01:04 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

Getting AI agents to write code is no longer new. The real challenge isn’t how smart the model is, but that agents lack robust, long‑running environments.

Harness Engineering

Harness Engineering is the discipline focused on building those environments.

  • Anthropic published a blog post in November 2025 describing “effective harnesses for long‑running agents.” [link]
  • OpenAI released a similar post in February 2026. [link]

OpenAI reported that a team of seven people generated about 1 million lines of code across 1,500 pull requests in five months—without writing a single line by hand (self‑reported).

On X, the post “the 10x skill of 2026 is Evaluation Engineering” went viral, highlighting the shift from “writing code” to “building environments where agents write good code.”

Agent Harness

The Agent Harness handles execution:

  • Automates environment setup.
  • Passes progress between sessions using progress files and Git.
  • Builds one feature at a time.
  • Runs end‑to‑end (E2E) tests automatically.

Evaluation Harness

The Evaluation Harness provides quantitative scoring of AI output:

  • EleutherAI maintains 60+ benchmarks.
  • Inspect AI offers 100+ pre‑built evaluations.
  • LLM‑as‑a‑judge lets AI grade AI.
  • These evaluations integrate with CI/CD gates and safety testing (e.g., MLCommons AILuminate’s 59,624 test prompts).

Anthropic’s Two‑Step System

  1. Setup Agent creates init.sh and a feature list in JSON.
  2. Coding Agent iterates over each feature:
    • Write code.
    • Write tests.
    • Commit changes.
    • Repeat.

Progress is persisted via claude-progress.txt and Git history.

The repository includes AGENTS.md (≈ 100 lines) that defines the rules for the entire codebase. Custom linters and CI enforce these rules automatically, removing the need to embed constraints in prompts.

OpenAI’s Approach

OpenAI’s environment is highly customized for a single repository. It emphasizes:

  • Tight integration with the repo’s CI pipeline.
  • Automated generation of PRs and test suites.

Because it is tailored to one project, it cannot be directly copied to other codebases without substantial adaptation.

Comparison and Limitations

AspectAnthropicOpenAI
Target domainFull‑stack web developmentSingle, highly customized repo
PortabilityMore generic, but still web‑focusedLow – not directly reusable
ScalabilityBreaks work into small steps, enforces repo‑wide rulesRelies on bespoke tooling per project
Untested areasScientific research, financial modelingOther domains beyond the original repo

Both companies converge on the same core principles:

  1. Store knowledge in the repository.
  2. Enforce rules with tooling (linters, CI).
  3. Decompose work into small, traceable steps.

Conclusion

Models will continue to become smarter, but even the most advanced model cannot sustain long‑running development without a well‑designed environment. The decisive factor isn’t the choice of model; it’s how you build the harness that supports it.

I cover AI agent designs, skills, and context engineering from the perspective of integrating AI into real teams and workflows. Analysis grounded in primary sources.

Follow for more:

0 views
Back to Blog

Related posts

Read more »