[Paper] Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Published: (April 28, 2026 at 12:55 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.25850v1

Overview

The paper introduces Agentic Harness Engineering (AHE), a system that automatically improves the “harnesses” that connect large language model (LLM) coding agents to code repositories, tools, and execution environments. By making every edit to a harness observable and verifiable, AHE lets the harness evolve on its own, achieving higher success rates than manually crafted setups.

Key Contributions

  • Observability‑driven framework that instruments the three core steps of harness evolution (editing, inspection, decision) with explicit, reversible representations.
  • Component observability: each editable part of a harness is treated as a distinct file, turning a vague edit space into a concrete, version‑controlled one.
  • Experience observability: raw execution traces (millions of tokens) are distilled into a layered evidence corpus that agents can actually read and learn from.
  • Decision observability: every edit is paired with a self‑declared prediction of its impact, which is later validated against real task outcomes, turning edits into falsifiable contracts.
  • Empirical gains: after ten autonomous AHE iterations, pass@1 on the Terminal‑Bench 2 benchmark jumps from 69.7 % to 77.0 %, beating the human‑engineered Codex‑CLI (71.9 %) and prior self‑evolving baselines (ACE, TF‑GRPO).
  • Cross‑model transferability: the final harness, frozen after evolution, improves performance on unrelated benchmarks (SWE‑bench‑verified) and yields 5–10 pp gains when used with three different model families, indicating the learned engineering knowledge is general rather than benchmark‑specific.

Methodology

  1. Instrumentation of the engineering loop – The authors wrap the typical iterative process (edit → run → evaluate) with three “observability pillars”.
    • Component observability: Harness components (e.g., CLI wrappers, tool adapters) are stored as separate files. Edits become explicit diffs that can be rolled back, making the action space tractable.
    • Experience observability: Execution trajectories (tool calls, file edits, error logs) are compressed into a hierarchical evidence corpus (high‑level summaries → detailed logs). This reduces the raw token count from millions to a size an LLM can ingest.
    • Decision observability: Before committing an edit, the agent writes a short hypothesis (“this change should reduce compile errors by 20 %”). After the next iteration, the hypothesis is automatically checked against the observed outcomes.
  2. Agentic evolution – A language model (the “agent”) proposes edits, consumes the distilled evidence, and updates its own hypothesis‑making module. The loop runs autonomously for multiple generations.
  3. Evaluation – After each generation, the harness is tested on Terminal‑Bench 2 (a suite of command‑line coding tasks) and SWE‑bench‑verified (software‑engineering tasks with verification). Metrics include pass@1, token usage, and cross‑family performance.

Results & Findings

MetricSeed HarnessAfter 10 AHE IterationsHuman‑crafted Codex‑CLIPrior Auto‑evolving Baselines
pass@1 (Terminal‑Bench 2)69.7 %77.0 %71.9 %ACE / TF‑GRPO < 73 %
Token efficiency (SWE‑bench‑verified)Baseline‑12 % tokens vs. seed
Cross‑family gain+5.1 pp to +10.1 pp across three model families
  • The harness improvements are stable: freezing the harness after evolution still yields gains on new benchmarks without further tuning.
  • The evidence corpus dramatically reduces the amount of raw data the agent needs to process, enabling it to reason about past failures and successes.
  • Decision observability prevents the loop from devolving into blind trial‑and‑error; each edit is held accountable to a measurable claim.

Practical Implications

  • Continuous improvement pipelines: Development teams can embed AHE into CI/CD to keep their LLM‑based code assistants up‑to‑date without manual harness tweaking.
  • Reduced engineering overhead: By treating harness components as version‑controlled files, developers can audit, revert, or share harness changes just like regular code.
  • Tool‑agnostic adapters: The evolved harness encodes generic engineering heuristics (e.g., better error‑recovery, smarter tool selection) that work across different LLM families, lowering the cost of switching models.
  • Cost savings: The token‑efficiency gains translate directly into lower API usage bills for services like OpenAI or Anthropic when running large‑scale code‑generation pipelines.
  • Better reliability for production‑grade agents: With observable contracts, teams can certify that a harness edit will not degrade performance, a crucial requirement for safety‑critical or regulated software environments.

Limitations & Future Work

  • Scalability of evidence synthesis: While the hierarchical corpus reduces token load, generating high‑quality summaries still requires a powerful LLM, which may be a bottleneck for very large codebases.
  • Benchmark focus: The experiments center on Terminal‑Bench 2 and SWE‑bench‑verified; broader real‑world workloads (e.g., multi‑repo microservice ecosystems) remain untested.
  • Human interpretability: Although edits are file‑level diffs, the rationale stored in the decision observability logs can be verbose and may need tooling to surface the most actionable insights.
  • Model dependency: The current agent is a single LLM; future work could explore ensembles or meta‑learning to further generalize the harness evolution across heterogeneous model architectures.

By addressing these points, the community can move toward fully autonomous, observability‑driven harness engineering that scales to the diverse, ever‑changing landscape of AI‑augmented software development.

Authors

  • Jiahang Lin
  • Shichun Liu
  • Chengjun Pan
  • Lizhi Lin
  • Shihan Dou
  • Xuanjing Huang
  • Hang Yan
  • Zhenhua Han
  • Tao Gui

Paper Information

  • arXiv ID: 2604.25850v1
  • Categories: cs.CL, cs.SE
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...

[Paper] A paradox of AI fluency

How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, b...