The secret isn't the model. It's the harness.

Published: 1 hour ago (March 7, 2026 at 01:04 PM EST)

3 min read

Source: Dev.to

Introduction

Getting AI agents to write code is no longer new. The real challenge isn’t how smart the model is, but that agents lack robust, long‑running environments.

Harness Engineering

Harness Engineering is the discipline focused on building those environments.

Anthropic published a blog post in November 2025 describing “effective harnesses for long‑running agents.” [link]
OpenAI released a similar post in February 2026. [link]

OpenAI reported that a team of seven people generated about 1 million lines of code across 1,500 pull requests in five months—without writing a single line by hand (self‑reported).

On X, the post “the 10x skill of 2026 is Evaluation Engineering” went viral, highlighting the shift from “writing code” to “building environments where agents write good code.”

Agent Harness

The Agent Harness handles execution:

Automates environment setup.
Passes progress between sessions using progress files and Git.
Builds one feature at a time.
Runs end‑to‑end (E2E) tests automatically.

Evaluation Harness

The Evaluation Harness provides quantitative scoring of AI output:

EleutherAI maintains 60+ benchmarks.
Inspect AI offers 100+ pre‑built evaluations.
LLM‑as‑a‑judge lets AI grade AI.
These evaluations integrate with CI/CD gates and safety testing (e.g., MLCommons AILuminate’s 59,624 test prompts).

Anthropic’s Two‑Step System

Setup Agent creates init.sh and a feature list in JSON.
Coding Agent iterates over each feature:
- Write code.
- Write tests.
- Commit changes.
- Repeat.

Progress is persisted via claude-progress.txt and Git history.

The repository includes AGENTS.md (≈ 100 lines) that defines the rules for the entire codebase. Custom linters and CI enforce these rules automatically, removing the need to embed constraints in prompts.

OpenAI’s Approach

OpenAI’s environment is highly customized for a single repository. It emphasizes:

Tight integration with the repo’s CI pipeline.
Automated generation of PRs and test suites.

Because it is tailored to one project, it cannot be directly copied to other codebases without substantial adaptation.

Comparison and Limitations

Aspect	Anthropic	OpenAI
Target domain	Full‑stack web development	Single, highly customized repo
Portability	More generic, but still web‑focused	Low – not directly reusable
Scalability	Breaks work into small steps, enforces repo‑wide rules	Relies on bespoke tooling per project
Untested areas	Scientific research, financial modeling	Other domains beyond the original repo

Both companies converge on the same core principles:

Store knowledge in the repository.
Enforce rules with tooling (linters, CI).
Decompose work into small, traceable steps.

Conclusion

Models will continue to become smarter, but even the most advanced model cannot sustain long‑running development without a well‑designed environment. The decisive factor isn’t the choice of model; it’s how you build the harness that supports it.

I cover AI agent designs, skills, and context engineering from the perspective of integrating AI into real teams and workflows. Analysis grounded in primary sources.

Follow for more:

The secret isn't the model. It's the harness.

Introduction

Harness Engineering

Agent Harness

Evaluation Harness

Anthropic’s Two‑Step System

OpenAI’s Approach

Comparison and Limitations

Conclusion

Related posts

How I Built an AI Image Generation Platform That Reached 48K+ Users

The missing layer between your AI agents and you

Why I'm Leaving My Comfort Zone: Staff Engineer AI first Engineer

Demystifying GCP: Generate Clear Google Cloud Platform Diagrams Automatically with Coordimap