[Paper] Scaling Test-Driven Code Generation from Functions to Classes: An Empirical Study
Source: arXiv - 2602.03557v1
Overview
The paper investigates how to extend test‑driven development (TDD) techniques—already shown to boost large language model (LLM) code generation for single functions—to the more complex realm of class‑level synthesis. By introducing an iterative TDD framework that respects method dependencies inside a class, the authors demonstrate a sizable jump in correctness for generated classes across several state‑of‑the‑art LLMs.
Key Contributions
- Iterative class‑level TDD framework that (1) analyzes intra‑class method dependencies, (2) schedules generation order, and (3) applies reflection‑style test feedback with bounded repair loops.
- ClassEval‑TDD benchmark, a cleaned, deterministic, and fully public‑test‑covered version of the existing ClassEval suite, enabling reproducible class‑level evaluation.
- Comprehensive empirical study on eight LLMs, comparing the new TDD pipeline against the strongest direct‑generation baselines (holistic, incremental, compositional).
- Quantitative gains: class‑level correctness improves by 12–26 absolute points, with up to 71 % of classes generated perfectly.
- Open‑source release of code, data, and evaluation scripts for community reuse.
Methodology
- Dependency Analysis – For each target class, the system parses the public API to build a directed graph of method calls and shared‑state accesses. This graph yields a feasible generation schedule (e.g., constructors first, then methods that rely on them).
- Iterative Generation – Methods are generated one‑by‑one using an LLM prompted with the method signature, docstring, and any already‑generated code.
- Public Test Execution – After a method is generated, its public unit tests are run in an isolated environment via Python’s
exec/eval(reflection). Test failures are captured as concrete error messages. - Bounded Repair Loop – If a test fails, the LLM receives the failure trace and a limited number of repair attempts (default ≤ 3). The loop stops once the test passes or the repair budget is exhausted.
- ClassEval‑TDD Benchmark – The authors curated 1,200 classes from open‑source projects, stripped nondeterministic behavior, and supplied a full set of method‑level public tests. This ensures that every generated class can be evaluated automatically and fairly.
The whole pipeline is automated, requiring only the LLM API key and the benchmark dataset.
Results & Findings
| Model | Direct‑gen (best) | Class‑level TDD | Δ Correctness (pts) | Fully Correct Classes |
|---|---|---|---|---|
| GPT‑4 | 48 % | 71 % | +23 | 71 % |
| Claude‑2 | 42 % | 64 % | +22 | 64 % |
| LLaMA‑2‑70B | 31 % | 55 % | +24 | 55 % |
| … | … | … | … | … |
- Average improvement: 12–26 absolute percentage points across all models.
- Repair efficiency: Most methods required ≤ 2 repair iterations; the average number of LLM calls per class grew by only ~1.3× compared to direct generation.
- Error patterns: Remaining failures were dominated by subtle state‑management bugs (e.g., forgetting to update an attribute) rather than syntax or API misuse, indicating that TDD helps with surface‑level correctness but deeper design issues persist.
Practical Implications
- Higher reliability for AI‑assisted IDEs – Integrating the class‑level TDD loop can turn “autocomplete‑style” suggestions into test‑validated code snippets, reducing the manual debugging burden for developers.
- Accelerated code review – Generated classes that already pass public tests can be merged faster, allowing reviewers to focus on architectural concerns.
- Bootstrapping legacy codebases – When modernizing monolithic modules, developers can ask an LLM to rewrite a class while the existing test suite serves as the executable specification, ensuring functional parity.
- Educational tools – The framework can be repurposed for teaching OOP concepts: students write tests first, then watch an LLM iteratively satisfy them, reinforcing the TDD mindset.
- Benchmarking new LLMs – ClassEval‑TDD offers a standardized, reproducible yardstick for measuring progress on class‑level generation, which is more representative of real‑world software than isolated functions.
Limitations & Future Work
- Public‑test coverage: The study assumes comprehensive method‑level tests; in practice, many codebases have sparse or flaky tests, which would limit the framework’s effectiveness.
- State‑complexity ceiling: Classes with intricate inheritance hierarchies or metaprogramming patterns were excluded; extending the dependency analysis to handle such cases remains open.
- Repair budget trade‑off: While a small number of repair iterations sufficed for most methods, certain edge cases required more attempts, inflating latency. Adaptive budgeting strategies could improve efficiency.
- Cross‑language applicability: The current implementation targets Python; applying the same pipeline to statically typed languages (Java, C#) will need richer type‑inference and compilation feedback loops.
The authors suggest exploring automated test generation to complement missing public tests, and integrating static analysis to catch deeper design flaws that TDD alone cannot resolve.
Authors
- Yunhao Liang
- Ruixuan Ying
- Shiwen Ni
- Zhe Cui
Paper Information
- arXiv ID: 2602.03557v1
- Categories: cs.SE
- Published: February 3, 2026
- PDF: Download PDF