[Paper] Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Published: (February 12, 2026 at 09:15 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.11988v1

Overview

The paper Evaluating AGENTS.md: Are Repository‑Level Context Files Helpful for Coding Agents? asks a simple but crucial question: do the “AGENTS.md” files that many teams add to their repos actually help AI‑powered coding assistants finish real development tasks? By testing both automatically generated and developer‑written context files across several benchmark suites, the authors show that these files often hurt performance while adding noticeable inference cost.

Key Contributions

  • Empirical evaluation of repository‑level context files (both LLM‑generated and human‑written) on two large testbeds: SWE‑bench tasks and a new collection of real‑world GitHub issues.
  • Comprehensive comparison across multiple coding agents (e.g., CodeGen, Claude, GPT‑4) and LLM back‑ends, revealing a consistent drop in task success when context files are present.
  • Cost analysis demonstrating >20 % extra token usage (and thus latency/price) when agents ingest AGENTS.md files.
  • Behavioral insight that context files steer agents toward broader exploration (more file traversal, richer test suites) but also cause them to over‑complicate solutions.
  • Guidelines for writing minimal, requirement‑focused context files that avoid unnecessary constraints.

Methodology

  1. Benchmarks

    • SWE‑bench: a curated set of coding tasks from popular open‑source repositories, each with a clear specification and test suite.
    • Real‑world issue set: 1,200+ GitHub issues drawn from projects that already include an AGENTS.md file committed by developers.
  2. Context File Generation

    • LLM‑generated: The authors prompted state‑of‑the‑art LLMs (GPT‑4, Claude, etc.) to produce AGENTS.md files following best‑practice recommendations from agent developers.
    • Human‑written: Directly used the AGENTS.md files already present in the real‑world issue set.
  3. Coding Agents

    • Several open‑source and commercial agents were evaluated (e.g., CodeGen‑2B, Claude‑Sonnet, GPT‑4‑Code). Each agent received either (a) no repository context, (b) only the context file, or (c) full repository snapshot plus the context file.
  4. Metrics

    • Task success rate (passing all tests).
    • Inference cost (total tokens processed).
    • Behavioral traces (file accesses, test generation, instruction compliance).
  5. Statistical Analysis

    • Paired t‑tests and bootstrap confidence intervals to assess significance of differences between “with‑context” and “without‑context” conditions.

Results & Findings

ConditionSuccess Rate (avg.)Token Overhead
No context48 %baseline
LLM‑generated AGENTS.md38 %+22 %
Human‑written AGENTS.md41 %+19 %
  • Success drops: Adding a context file reduced the probability of passing all tests by 7‑10 percentage points across agents.
  • Higher cost: The extra tokens needed to read and follow the context file translated into longer latency and higher API bills (≈ $0.03 per 1 k tokens for GPT‑4).
  • Exploratory behavior: Agents with a context file opened ≈ 30 % more files and generated ≈ 15 % more unit tests, indicating they were trying to satisfy broader requirements.
  • Instruction adherence: Agents reliably followed the explicit directives in AGENTS.md (e.g., “use library X”), but many of those directives were unnecessary for the task at hand, leading to over‑engineering.

Overall, the study suggests that unnecessary or overly prescriptive context files make the coding problem harder for LLM agents.

Practical Implications

  • Skip or prune AGENTS.md: For most repositories, omitting a detailed AGENTS.md (or keeping it extremely short) yields better success rates and lower costs.
  • Minimalist design: If a context file is needed (e.g., to expose a custom build script or secret token), limit it to essential, task‑agnostic requirements.
  • Tooling updates: IDE plugins and CI integrations that auto‑inject AGENTS.md into prompts should be configurable to disable the feature by default or to filter out non‑essential sections.
  • Cost budgeting: Teams using paid LLM APIs can estimate a 20 % cost increase per request when including context files, which can be significant at scale.
  • Agent instruction handling: Developers can trust that agents will obey explicit instructions, but they should avoid adding constraints that are not strictly required (e.g., “must use library Y” when the task can be solved without it).

Limitations & Future Work

  • Scope of repositories: The benchmarks focus on popular open‑source projects; results may differ for highly domain‑specific or monolithic codebases.
  • LLM diversity: Only a handful of leading LLMs were tested; newer or fine‑tuned models could react differently to context files.
  • Static vs. dynamic context: The study examined static AGENTS.md files; future work could explore dynamic context generation (e.g., on‑the‑fly extraction of relevant files).
  • User studies: The paper does not assess developer satisfaction or downstream maintenance impact; integrating human‑in‑the‑loop evaluations would round out the picture.

Bottom line: While the idea of “telling” an AI assistant about your repo sounds helpful, the evidence shows that less is more—keep repository‑level context files lean, or skip them altogether, to get faster, cheaper, and more reliable code generation.*

Authors

  • Thibaud Gloaguen
  • Niels Mündler
  • Mark Müller
  • Veselin Raychev
  • Martin Vechev

Paper Information

  • arXiv ID: 2602.11988v1
  • Categories: cs.SE, cs.AI
  • Published: February 12, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »