[Paper] BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

Published: 4 days ago (May 7, 2026 at 08:35 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06136v1

Overview

The paper BUILD‑AND‑FIND proposes a new evaluation protocol for code‑generation agents that goes beyond “does the program run correctly?” — it asks whether downstream agents can understand and reuse a repository that was created by an upstream agent. In other words, the work treats a generated codebase as a communication artifact and measures how much effort is needed for a second agent to recover the original design intent.

Key Contributions

Effort‑aware evaluation protocol – introduces builder and finder roles that separate code creation from intent recovery, allowing measurement of both correctness and inspectability.
Multi‑dimensional metrics – defines recovery accuracy, repeatability, implementation coverage, and inspection effort (e.g., number of lines inspected, time spent).
Control conditions – includes “question‑only” and “spec‑only” baselines to isolate the value of the generated artifact itself.
Auditing mechanism – checks whether a finder’s correct answers are actually backed by evidence in the repository, distinguishing lucky guesses from genuine understanding.
Open benchmark pack – releases a high‑prior task suite where behavioral correctness is already saturated, making inspection effort the primary differentiator among agents.

Methodology

Task Definition – Each benchmark task comes with a hidden specification (the intended behavior and design choices).
Builder Phase – An AI agent (the builder) receives the hidden spec and must produce a full repository (source files, README, tests, etc.).
Finder Phase – A second agent (the finder) only sees the generated repository and a spec‑traced multiple‑choice questionnaire that asks about the original design decisions (e.g., “Which algorithm was intended for X?”).
Metrics Collection
- Recovery Accuracy: percentage of correctly answered questions.
- Repeatability: consistency of answers across multiple runs of the same finder.
- Implementation Coverage: proportion of the spec that is actually represented in the code.
- Inspection Effort: measured by the amount of code the finder had to read (lines, tokens, or wall‑clock time) before answering each question.
Controls & Audits
- Question‑only: finder gets only the questionnaire (no code).
- Spec‑only: finder gets the hidden spec (no code).
- Audit: verifies that a correct answer can be traced back to a concrete artifact (e.g., a comment, function name, test case).

The protocol treats effort as meaningful only when accuracy and repeatability pass predefined gates, ensuring that low effort isn’t simply due to random guessing.

Results & Findings

On the released high‑prior task pack, recovery accuracy is already near the ceiling (≈ 95 %+), confirming that the builder agents are able to embed the intended design.
Inspection effort varies widely across finder agents: some models locate the needed information after scanning only a few dozen lines, while others must read the entire repository.
The question‑only baseline scores dramatically lower, proving that the generated artifact carries substantial explanatory value beyond the spec.
Audits show that ≈ 88 % of correct answers are backed by explicit evidence in the code (e.g., naming conventions, comments, test assertions), indicating that builders are not just “cheating” by embedding hidden cues.
Finder‑specific effects emerge: fine‑tuned retrieval models achieve up to 30 % lower inspection effort than generic LLMs, even when both achieve the same accuracy.

Practical Implications

Tooling for AI‑augmented development – Companies building code‑generation assistants can use BUILD‑AND‑FIND to benchmark not only whether the code works, but whether it is maintainable by other agents (or human developers).
Continuous integration pipelines – Automated reviewers could be evaluated on how quickly they can spot regressions or security concerns in AI‑generated repos, leading to more efficient CI checks.
Agent collaboration frameworks – The builder/finder split mirrors real‑world workflows where one AI drafts a library and another audits or extends it; the protocol gives a quantitative way to compare collaboration strategies.
Documentation generation – Lower inspection effort correlates with clearer in‑code documentation and structure, suggesting that the protocol can serve as a proxy metric for “self‑explanatory” code.
Hiring and model selection – Teams can pick the finder model that offers the best trade‑off between speed (effort) and reliability, optimizing for rapid codebase onboarding.

Limitations & Future Work

Task diversity – The current benchmark focuses on high‑prior, well‑specified tasks; more complex, ambiguous domains (e.g., legacy code refactoring) remain untested.
Human baseline – The study does not compare finder agents against skilled human developers, leaving open the question of how AI effort stacks up against human inspection time.
Effort measurement granularity – Counting lines or tokens is a coarse proxy for cognitive effort; future work could incorporate eye‑tracking or interaction logs for finer granularity.
Scalability – Running the full protocol (builder + multiple finder runs) is computationally expensive; streamlined variants are needed for large‑scale model evaluation.

The authors plan to expand the task suite, integrate human participants, and explore richer effort metrics in upcoming releases.

Authors

Jhen-Ke Lin

Paper Information

arXiv ID: 2605.06136v1
Categories: cs.SE, cs.AI
Published: May 7, 2026
PDF: Download PDF

[Paper] BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction