[Paper] Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Published: 20 hours ago (April 28, 2026 at 12:55 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.25850v1

Overview

The paper introduces Agentic Harness Engineering (AHE), a system that automatically improves the “harnesses” that connect large language model (LLM) coding agents to code repositories, tools, and execution environments. By making every edit to a harness observable and verifiable, AHE lets the harness evolve on its own, achieving higher success rates than manually crafted setups.

Key Contributions

Observability‑driven framework that instruments the three core steps of harness evolution (editing, inspection, decision) with explicit, reversible representations.
Component observability: each editable part of a harness is treated as a distinct file, turning a vague edit space into a concrete, version‑controlled one.
Experience observability: raw execution traces (millions of tokens) are distilled into a layered evidence corpus that agents can actually read and learn from.
Decision observability: every edit is paired with a self‑declared prediction of its impact, which is later validated against real task outcomes, turning edits into falsifiable contracts.
Empirical gains: after ten autonomous AHE iterations, pass@1 on the Terminal‑Bench 2 benchmark jumps from 69.7 % to 77.0 %, beating the human‑engineered Codex‑CLI (71.9 %) and prior self‑evolving baselines (ACE, TF‑GRPO).
Cross‑model transferability: the final harness, frozen after evolution, improves performance on unrelated benchmarks (SWE‑bench‑verified) and yields 5–10 pp gains when used with three different model families, indicating the learned engineering knowledge is general rather than benchmark‑specific.

Methodology

Instrumentation of the engineering loop – The authors wrap the typical iterative process (edit → run → evaluate) with three “observability pillars”.
- Component observability: Harness components (e.g., CLI wrappers, tool adapters) are stored as separate files. Edits become explicit diffs that can be rolled back, making the action space tractable.
- Experience observability: Execution trajectories (tool calls, file edits, error logs) are compressed into a hierarchical evidence corpus (high‑level summaries → detailed logs). This reduces the raw token count from millions to a size an LLM can ingest.
- Decision observability: Before committing an edit, the agent writes a short hypothesis (“this change should reduce compile errors by 20 %”). After the next iteration, the hypothesis is automatically checked against the observed outcomes.
Agentic evolution – A language model (the “agent”) proposes edits, consumes the distilled evidence, and updates its own hypothesis‑making module. The loop runs autonomously for multiple generations.
Evaluation – After each generation, the harness is tested on Terminal‑Bench 2 (a suite of command‑line coding tasks) and SWE‑bench‑verified (software‑engineering tasks with verification). Metrics include pass@1, token usage, and cross‑family performance.

Results & Findings

Metric	Seed Harness	After 10 AHE Iterations	Human‑crafted Codex‑CLI	Prior Auto‑evolving Baselines
pass@1 (Terminal‑Bench 2)	69.7 %	77.0 %	71.9 %	ACE / TF‑GRPO < 73 %
Token efficiency (SWE‑bench‑verified)	Baseline	‑12 % tokens vs. seed	–	–
Cross‑family gain	–	+5.1 pp to +10.1 pp across three model families	–	–

The harness improvements are stable: freezing the harness after evolution still yields gains on new benchmarks without further tuning.
The evidence corpus dramatically reduces the amount of raw data the agent needs to process, enabling it to reason about past failures and successes.
Decision observability prevents the loop from devolving into blind trial‑and‑error; each edit is held accountable to a measurable claim.

Practical Implications

Continuous improvement pipelines: Development teams can embed AHE into CI/CD to keep their LLM‑based code assistants up‑to‑date without manual harness tweaking.
Reduced engineering overhead: By treating harness components as version‑controlled files, developers can audit, revert, or share harness changes just like regular code.
Tool‑agnostic adapters: The evolved harness encodes generic engineering heuristics (e.g., better error‑recovery, smarter tool selection) that work across different LLM families, lowering the cost of switching models.
Cost savings: The token‑efficiency gains translate directly into lower API usage bills for services like OpenAI or Anthropic when running large‑scale code‑generation pipelines.
Better reliability for production‑grade agents: With observable contracts, teams can certify that a harness edit will not degrade performance, a crucial requirement for safety‑critical or regulated software environments.

Limitations & Future Work

Scalability of evidence synthesis: While the hierarchical corpus reduces token load, generating high‑quality summaries still requires a powerful LLM, which may be a bottleneck for very large codebases.
Benchmark focus: The experiments center on Terminal‑Bench 2 and SWE‑bench‑verified; broader real‑world workloads (e.g., multi‑repo microservice ecosystems) remain untested.
Human interpretability: Although edits are file‑level diffs, the rationale stored in the decision observability logs can be verbose and may need tooling to surface the most actionable insights.
Model dependency: The current agent is a single LLM; future work could explore ensembles or meta‑learning to further generalize the harness evolution across heterogeneous model architectures.

By addressing these points, the community can move toward fully autonomous, observability‑driven harness engineering that scales to the diverse, ever‑changing landscape of AI‑augmented software development.

Authors

Jiahang Lin
Shichun Liu
Chengjun Pan
Lizhi Lin
Shihan Dou
Xuanjing Huang
Hang Yan
Zhenhua Han
Tao Gui

Paper Information

arXiv ID: 2604.25850v1
Categories: cs.CL, cs.SE
Published: April 28, 2026
PDF: Download PDF

[Paper] Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

[Paper] A paradox of AI fluency

[Paper] Toward a Functional Geometric Algebra for Natural Language Semantics