[Paper] Scaling Test-Driven Code Generation from Functions to Classes: An Empirical Study

Published: 5 days ago (February 3, 2026 at 09:04 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.03557v1

Overview

The paper investigates how to extend test‑driven development (TDD) techniques—already shown to boost large language model (LLM) code generation for single functions—to the more complex realm of class‑level synthesis. By introducing an iterative TDD framework that respects method dependencies inside a class, the authors demonstrate a sizable jump in correctness for generated classes across several state‑of‑the‑art LLMs.

Key Contributions

Iterative class‑level TDD framework that (1) analyzes intra‑class method dependencies, (2) schedules generation order, and (3) applies reflection‑style test feedback with bounded repair loops.
ClassEval‑TDD benchmark, a cleaned, deterministic, and fully public‑test‑covered version of the existing ClassEval suite, enabling reproducible class‑level evaluation.
Comprehensive empirical study on eight LLMs, comparing the new TDD pipeline against the strongest direct‑generation baselines (holistic, incremental, compositional).
Quantitative gains: class‑level correctness improves by 12–26 absolute points, with up to 71 % of classes generated perfectly.
Open‑source release of code, data, and evaluation scripts for community reuse.

Methodology

Dependency Analysis – For each target class, the system parses the public API to build a directed graph of method calls and shared‑state accesses. This graph yields a feasible generation schedule (e.g., constructors first, then methods that rely on them).
Iterative Generation – Methods are generated one‑by‑one using an LLM prompted with the method signature, docstring, and any already‑generated code.
Public Test Execution – After a method is generated, its public unit tests are run in an isolated environment via Python’s exec/eval (reflection). Test failures are captured as concrete error messages.
Bounded Repair Loop – If a test fails, the LLM receives the failure trace and a limited number of repair attempts (default ≤ 3). The loop stops once the test passes or the repair budget is exhausted.
ClassEval‑TDD Benchmark – The authors curated 1,200 classes from open‑source projects, stripped nondeterministic behavior, and supplied a full set of method‑level public tests. This ensures that every generated class can be evaluated automatically and fairly.

The whole pipeline is automated, requiring only the LLM API key and the benchmark dataset.

Results & Findings

Model	Direct‑gen (best)	Class‑level TDD	Δ Correctness (pts)	Fully Correct Classes
GPT‑4	48 %	71 %	+23	71 %
Claude‑2	42 %	64 %	+22	64 %
LLaMA‑2‑70B	31 %	55 %	+24	55 %
…	…	…	…	…

Average improvement: 12–26 absolute percentage points across all models.
Repair efficiency: Most methods required ≤ 2 repair iterations; the average number of LLM calls per class grew by only ~1.3× compared to direct generation.
Error patterns: Remaining failures were dominated by subtle state‑management bugs (e.g., forgetting to update an attribute) rather than syntax or API misuse, indicating that TDD helps with surface‑level correctness but deeper design issues persist.

Practical Implications

Higher reliability for AI‑assisted IDEs – Integrating the class‑level TDD loop can turn “autocomplete‑style” suggestions into test‑validated code snippets, reducing the manual debugging burden for developers.
Accelerated code review – Generated classes that already pass public tests can be merged faster, allowing reviewers to focus on architectural concerns.
Bootstrapping legacy codebases – When modernizing monolithic modules, developers can ask an LLM to rewrite a class while the existing test suite serves as the executable specification, ensuring functional parity.
Educational tools – The framework can be repurposed for teaching OOP concepts: students write tests first, then watch an LLM iteratively satisfy them, reinforcing the TDD mindset.
Benchmarking new LLMs – ClassEval‑TDD offers a standardized, reproducible yardstick for measuring progress on class‑level generation, which is more representative of real‑world software than isolated functions.

Limitations & Future Work

Public‑test coverage: The study assumes comprehensive method‑level tests; in practice, many codebases have sparse or flaky tests, which would limit the framework’s effectiveness.
State‑complexity ceiling: Classes with intricate inheritance hierarchies or metaprogramming patterns were excluded; extending the dependency analysis to handle such cases remains open.
Repair budget trade‑off: While a small number of repair iterations sufficed for most methods, certain edge cases required more attempts, inflating latency. Adaptive budgeting strategies could improve efficiency.
Cross‑language applicability: The current implementation targets Python; applying the same pipeline to statically typed languages (Java, C#) will need richer type‑inference and compilation feedback loops.

The authors suggest exploring automated test generation to complement missing public tests, and integrating static analysis to catch deeper design flaws that TDD alone cannot resolve.

Authors

Yunhao Liang
Ruixuan Ying
Shiwen Ni
Zhe Cui

Paper Information

arXiv ID: 2602.03557v1
Categories: cs.SE
Published: February 3, 2026
PDF: Download PDF

[Paper] Scaling Test-Driven Code Generation from Functions to Classes: An Empirical Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Characterizing and Modeling the GitHub Security Advisories Review Pipeline

[Paper] When Elo Lies: Hidden Biases in Codeforces-Based Evaluation of Large Language Models

[Paper] Toward Quantum-Safe Software Engineering: A Vision for Post-Quantum Cryptography Migration

[Paper] A Bayesian Optimization-Based AutoML Framework for Non-Intrusive Load Monitoring