[Paper] FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Published: 2 months ago (February 11, 2026 at 11:06 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10975v1

Overview

The paper introduces FeatureBench, a new benchmark that evaluates how well large‑language‑model (LLM)‑powered coding agents can develop complete software features, not just fix isolated bugs. By automatically extracting feature‑level tasks from real open‑source projects and testing them in executable environments, the authors expose a stark gap between current agent capabilities and the demands of real‑world development.

Key Contributions

Feature‑oriented benchmark: First systematic suite that measures end‑to‑end feature development across multiple commits and pull requests.
Execution‑based evaluation: Every task runs in a fully provisioned environment, guaranteeing that reported scores reflect actual, runnable code.
Scalable, test‑driven task generation: A lightweight toolkit traces unit‑test dependencies to automatically carve out feature tasks from existing repositories, requiring minimal human curation.
Large, diverse dataset: 200 challenging feature tasks and 3 825 executable environments drawn from 24 popular open‑source projects.
Baseline performance gap: State‑of‑the‑art agent Claude 4.5 Opus solves only 11 % of FeatureBench tasks (vs. 74 % on the older SWE‑bench), highlighting a new research frontier.

Methodology

Repository selection – Choose mature open‑source projects with rich test suites.
Dependency graph construction – Build a graph linking unit tests to the source files they touch.
Feature extraction – Starting from a test, walk upstream through the graph to collect all code changes (commits/PRs) that affect the test, forming a self‑contained “feature” slice.
Isolation & verification – Spin up a Docker‑style container for each slice, run the full test suite, and confirm that unrelated features still pass.
Task definition – Present the agent with the initial repository state and the target test; the agent must produce the missing code to make the test pass.
Automated scoring – Success is binary: the generated code compiles, the target test passes, and no regression is introduced.

The pipeline is fully scriptable, enabling continuous addition of new tasks as projects evolve.

Results & Findings

Model	Success on SWE‑bench	Success on FeatureBench
Claude 4.5 Opus	74.4 %	11.0 %
Other strong agents (reported)	~60 %	8–12 %

Sharp performance drop: Agents that excel on single‑PR bug‑fix benchmarks stumble when required to synthesize multi‑commit feature logic.
Task difficulty: The curated tasks involve cross‑file dependencies, API design, and integration concerns that are rarely present in existing benchmarks.
Scalability proof: Using the automated toolkit, the authors generated the full dataset in a few hours, demonstrating that the benchmark can be refreshed regularly to avoid data leakage.

Practical Implications

Tooling developers: Companies building AI pair‑programmers should treat FeatureBench as a more realistic validation suite before shipping agents to production.
CI/CD integration: The execution‑based framework can be repurposed as a continuous evaluation harness for internal LLM‑coding agents, catching regressions early.
Training data curation: The automatically extracted feature slices provide high‑quality, verifiable examples that can be fed back into model fine‑tuning pipelines.
Developer workflow: Understanding the current limits helps teams set appropriate expectations—agents are still better suited for micro‑tasks (e.g., scaffolding, refactoring) than for delivering full‑blown features without human oversight.

Limitations & Future Work

Scope of languages: The current version focuses on a handful of languages (primarily Python/JavaScript); extending to compiled languages will require more sophisticated build environments.
Test quality dependence: The benchmark’s reliability hinges on the completeness of the underlying unit tests; poorly written tests could misrepresent a feature’s true difficulty.
Human‑in‑the‑loop evaluation: While execution‑based scoring is objective, it does not capture code readability, style, or maintainability—areas where developers still need to intervene.
Future directions: The authors plan to (1) broaden language coverage, (2) incorporate higher‑level functional specifications (e.g., user stories), and (3) explore multi‑agent collaboration scenarios where one LLM drafts a feature and another reviews it.

Authors

Qixing Zhou
Jiacheng Zhang
Haiyang Wang
Rui Hao
Jiahe Wang
Minghao Han
Yuxue Yang
Shuzhe Wu
Feiyang Pan
Lue Fan
Dandan Tu
Zhaoxiang Zhang

Paper Information

arXiv ID: 2602.10975v1
Categories: cs.SE, cs.AI
Published: February 11, 2026
PDF: Download PDF

[Paper] FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Study: Self-generated Agent Skills are useless

🦄 Peter Steinberger (creator of OpenClaw) is joining OpenAI to help build the next generation of personal agents. OpenClaw will move into a foundation and stay open-source, with continued support.

Linear Representations and Superposition

Mamba-2 vs Griffin vs RWKV-6: SSM Architecture Benchmark