[Paper] BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

Published: (May 7, 2026 at 08:35 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.06136v1

Overview

The paper BUILD‑AND‑FIND proposes a new evaluation protocol for code‑generation agents that goes beyond “does the program run correctly?” — it asks whether downstream agents can understand and reuse a repository that was created by an upstream agent. In other words, the work treats a generated codebase as a communication artifact and measures how much effort is needed for a second agent to recover the original design intent.

Key Contributions

  • Effort‑aware evaluation protocol – introduces builder and finder roles that separate code creation from intent recovery, allowing measurement of both correctness and inspectability.
  • Multi‑dimensional metrics – defines recovery accuracy, repeatability, implementation coverage, and inspection effort (e.g., number of lines inspected, time spent).
  • Control conditions – includes “question‑only” and “spec‑only” baselines to isolate the value of the generated artifact itself.
  • Auditing mechanism – checks whether a finder’s correct answers are actually backed by evidence in the repository, distinguishing lucky guesses from genuine understanding.
  • Open benchmark pack – releases a high‑prior task suite where behavioral correctness is already saturated, making inspection effort the primary differentiator among agents.

Methodology

  1. Task Definition – Each benchmark task comes with a hidden specification (the intended behavior and design choices).
  2. Builder Phase – An AI agent (the builder) receives the hidden spec and must produce a full repository (source files, README, tests, etc.).
  3. Finder Phase – A second agent (the finder) only sees the generated repository and a spec‑traced multiple‑choice questionnaire that asks about the original design decisions (e.g., “Which algorithm was intended for X?”).
  4. Metrics Collection
    • Recovery Accuracy: percentage of correctly answered questions.
    • Repeatability: consistency of answers across multiple runs of the same finder.
    • Implementation Coverage: proportion of the spec that is actually represented in the code.
    • Inspection Effort: measured by the amount of code the finder had to read (lines, tokens, or wall‑clock time) before answering each question.
  5. Controls & Audits
    • Question‑only: finder gets only the questionnaire (no code).
    • Spec‑only: finder gets the hidden spec (no code).
    • Audit: verifies that a correct answer can be traced back to a concrete artifact (e.g., a comment, function name, test case).

The protocol treats effort as meaningful only when accuracy and repeatability pass predefined gates, ensuring that low effort isn’t simply due to random guessing.

Results & Findings

  • On the released high‑prior task pack, recovery accuracy is already near the ceiling (≈ 95 %+), confirming that the builder agents are able to embed the intended design.
  • Inspection effort varies widely across finder agents: some models locate the needed information after scanning only a few dozen lines, while others must read the entire repository.
  • The question‑only baseline scores dramatically lower, proving that the generated artifact carries substantial explanatory value beyond the spec.
  • Audits show that ≈ 88 % of correct answers are backed by explicit evidence in the code (e.g., naming conventions, comments, test assertions), indicating that builders are not just “cheating” by embedding hidden cues.
  • Finder‑specific effects emerge: fine‑tuned retrieval models achieve up to 30 % lower inspection effort than generic LLMs, even when both achieve the same accuracy.

Practical Implications

  • Tooling for AI‑augmented development – Companies building code‑generation assistants can use BUILD‑AND‑FIND to benchmark not only whether the code works, but whether it is maintainable by other agents (or human developers).
  • Continuous integration pipelines – Automated reviewers could be evaluated on how quickly they can spot regressions or security concerns in AI‑generated repos, leading to more efficient CI checks.
  • Agent collaboration frameworks – The builder/finder split mirrors real‑world workflows where one AI drafts a library and another audits or extends it; the protocol gives a quantitative way to compare collaboration strategies.
  • Documentation generation – Lower inspection effort correlates with clearer in‑code documentation and structure, suggesting that the protocol can serve as a proxy metric for “self‑explanatory” code.
  • Hiring and model selection – Teams can pick the finder model that offers the best trade‑off between speed (effort) and reliability, optimizing for rapid codebase onboarding.

Limitations & Future Work

  • Task diversity – The current benchmark focuses on high‑prior, well‑specified tasks; more complex, ambiguous domains (e.g., legacy code refactoring) remain untested.
  • Human baseline – The study does not compare finder agents against skilled human developers, leaving open the question of how AI effort stacks up against human inspection time.
  • Effort measurement granularity – Counting lines or tokens is a coarse proxy for cognitive effort; future work could incorporate eye‑tracking or interaction logs for finer granularity.
  • Scalability – Running the full protocol (builder + multiple finder runs) is computationally expensive; streamlined variants are needed for large‑scale model evaluation.

The authors plan to expand the task suite, integrate human participants, and explore richer effort metrics in upcoming releases.

Authors

  • Jhen-Ke Lin

Paper Information

  • arXiv ID: 2605.06136v1
  • Categories: cs.SE, cs.AI
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...