[Paper] BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases
Source: arXiv - 2605.06136v1
Overview
The paper BUILD‑AND‑FIND proposes a new evaluation protocol for code‑generation agents that goes beyond “does the program run correctly?” — it asks whether downstream agents can understand and reuse a repository that was created by an upstream agent. In other words, the work treats a generated codebase as a communication artifact and measures how much effort is needed for a second agent to recover the original design intent.
Key Contributions
- Effort‑aware evaluation protocol – introduces builder and finder roles that separate code creation from intent recovery, allowing measurement of both correctness and inspectability.
- Multi‑dimensional metrics – defines recovery accuracy, repeatability, implementation coverage, and inspection effort (e.g., number of lines inspected, time spent).
- Control conditions – includes “question‑only” and “spec‑only” baselines to isolate the value of the generated artifact itself.
- Auditing mechanism – checks whether a finder’s correct answers are actually backed by evidence in the repository, distinguishing lucky guesses from genuine understanding.
- Open benchmark pack – releases a high‑prior task suite where behavioral correctness is already saturated, making inspection effort the primary differentiator among agents.
Methodology
- Task Definition – Each benchmark task comes with a hidden specification (the intended behavior and design choices).
- Builder Phase – An AI agent (the builder) receives the hidden spec and must produce a full repository (source files, README, tests, etc.).
- Finder Phase – A second agent (the finder) only sees the generated repository and a spec‑traced multiple‑choice questionnaire that asks about the original design decisions (e.g., “Which algorithm was intended for X?”).
- Metrics Collection
- Recovery Accuracy: percentage of correctly answered questions.
- Repeatability: consistency of answers across multiple runs of the same finder.
- Implementation Coverage: proportion of the spec that is actually represented in the code.
- Inspection Effort: measured by the amount of code the finder had to read (lines, tokens, or wall‑clock time) before answering each question.
- Controls & Audits
- Question‑only: finder gets only the questionnaire (no code).
- Spec‑only: finder gets the hidden spec (no code).
- Audit: verifies that a correct answer can be traced back to a concrete artifact (e.g., a comment, function name, test case).
The protocol treats effort as meaningful only when accuracy and repeatability pass predefined gates, ensuring that low effort isn’t simply due to random guessing.
Results & Findings
- On the released high‑prior task pack, recovery accuracy is already near the ceiling (≈ 95 %+), confirming that the builder agents are able to embed the intended design.
- Inspection effort varies widely across finder agents: some models locate the needed information after scanning only a few dozen lines, while others must read the entire repository.
- The question‑only baseline scores dramatically lower, proving that the generated artifact carries substantial explanatory value beyond the spec.
- Audits show that ≈ 88 % of correct answers are backed by explicit evidence in the code (e.g., naming conventions, comments, test assertions), indicating that builders are not just “cheating” by embedding hidden cues.
- Finder‑specific effects emerge: fine‑tuned retrieval models achieve up to 30 % lower inspection effort than generic LLMs, even when both achieve the same accuracy.
Practical Implications
- Tooling for AI‑augmented development – Companies building code‑generation assistants can use BUILD‑AND‑FIND to benchmark not only whether the code works, but whether it is maintainable by other agents (or human developers).
- Continuous integration pipelines – Automated reviewers could be evaluated on how quickly they can spot regressions or security concerns in AI‑generated repos, leading to more efficient CI checks.
- Agent collaboration frameworks – The builder/finder split mirrors real‑world workflows where one AI drafts a library and another audits or extends it; the protocol gives a quantitative way to compare collaboration strategies.
- Documentation generation – Lower inspection effort correlates with clearer in‑code documentation and structure, suggesting that the protocol can serve as a proxy metric for “self‑explanatory” code.
- Hiring and model selection – Teams can pick the finder model that offers the best trade‑off between speed (effort) and reliability, optimizing for rapid codebase onboarding.
Limitations & Future Work
- Task diversity – The current benchmark focuses on high‑prior, well‑specified tasks; more complex, ambiguous domains (e.g., legacy code refactoring) remain untested.
- Human baseline – The study does not compare finder agents against skilled human developers, leaving open the question of how AI effort stacks up against human inspection time.
- Effort measurement granularity – Counting lines or tokens is a coarse proxy for cognitive effort; future work could incorporate eye‑tracking or interaction logs for finer granularity.
- Scalability – Running the full protocol (builder + multiple finder runs) is computationally expensive; streamlined variants are needed for large‑scale model evaluation.
The authors plan to expand the task suite, integrate human participants, and explore richer effort metrics in upcoming releases.
Authors
- Jhen-Ke Lin
Paper Information
- arXiv ID: 2605.06136v1
- Categories: cs.SE, cs.AI
- Published: May 7, 2026
- PDF: Download PDF