[Paper] RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements
Source: arXiv - 2604.25862v1
Overview
The paper introduces RESTestBench, a new benchmark designed to evaluate how well large language models (LLMs) can generate REST‑API test cases directly from natural‑language (NL) requirements. By moving beyond traditional code‑coverage metrics, the authors provide a way to measure whether generated tests actually validate the functional intent expressed in the requirements—a crucial step for developers who rely on AI‑assisted testing.
Key Contributions
- RESTestBench benchmark: 3 realistic REST services paired with manually verified NL requirements, each offered in a precise and a vague version.
- Requirements‑based mutation testing metric: Extends Bartocci et al.’s property‑based mutation approach to quantify fault‑detection power of a test case with respect to a specific requirement.
- Empirical evaluation of two LLM‑driven strategies:
- Non‑refinement generation (pure prompt‑to‑test).
- Refinement generation (iterative interaction with the running SUT, including mutated versions).
- Insights on refinement usefulness: Shows that exposing the generator to faulty code can actually hurt test effectiveness, especially when requirements are vague.
Methodology
- Benchmark construction – The authors built three open‑source REST APIs (e.g., a book‑store, a todo list, and a user‑profile service). For each endpoint they wrote two NL requirement statements: a precise description (e.g., “POST /books must reject a request when the ISBN already exists”) and a vague one (e.g., “Adding a book should work correctly”).
- Mutation generation – Each service was automatically mutated (e.g., by flipping a conditional, removing a validation) to create faulty versions that violate specific requirements.
- LLM test generation – Six state‑of‑the‑art LLMs (GPT‑4, Claude, Llama 2, etc.) were prompted to produce test cases in a popular testing framework (e.g., REST‑Assured or Postman). Two pipelines were compared:
- Non‑refinement: Prompt → test case.
- Refinement: Prompt → test case → run against the live SUT → feed back the response → optionally adjust the test.
- Evaluation metric – For each generated test, the mutation testing metric checks whether the test fails on the mutated SUT that violates the target requirement while passing on the correct implementation. This yields a requirement‑specific fault detection score (0–1).
Results & Findings
| Scenario | Precise Req. | Vague Req. |
|---|---|---|
| Non‑refinement (baseline) | 0.78 avg. detection | 0.45 avg. detection |
| Refinement on correct SUT | 0.81 (↑) | 0.48 (↑) |
| Refinement on mutated SUT | 0.62 (↓) | 0.30 (↓) |
- Precise requirements already give LLMs a strong signal; refinement adds only a modest boost.
- Vague requirements suffer dramatically when the generator sees a mutated implementation—tests start to over‑fit the buggy behavior and miss the intended functionality.
- The benefit of refinement disappears entirely for vague specs, suggesting that “show me the code” may be counter‑productive unless the requirement is crystal‑clear.
Practical Implications
- For DevOps / QA teams: RESTestBench offers a ready‑to‑use suite to benchmark any in‑house LLM‑based test generator before integrating it into CI pipelines.
- Test‑case generation tooling: Vendors should expose a requirement‑clarity flag; when requirements are ambiguous, the tool should avoid runtime refinement and rely on pure prompting.
- Cost‑effective testing: Developers can focus effort on writing precise, testable requirements (e.g., using Given/When/Then style) to reap the full benefits of LLM‑generated tests without needing extra execution cycles for refinement.
- Safety‑critical APIs: The mutation‑based metric provides a concrete, requirement‑driven safety gauge that can be incorporated into compliance checklists (e.g., for fintech or healthcare APIs).
Limitations & Future Work
- Benchmark size – Only three services were used; broader domain coverage (e.g., streaming, GraphQL) would improve generalizability.
- LLM prompt engineering – The study kept prompts simple; exploring richer prompt templates or few‑shot examples could change the refinement dynamics.
- Mutation realism – Automated mutations may not capture the full spectrum of real bugs developers encounter; future work could involve human‑in‑the‑loop bug injection.
- Extending metrics – The current metric focuses on binary pass/fail per requirement; incorporating severity or multi‑requirement interactions is an open avenue.
Bottom line: RESTestBench shines a light on the missing link between AI‑generated tests and the functional intent they’re supposed to verify. By providing a concrete, requirement‑centric evaluation framework, it helps developers decide when to trust LLM‑driven testing and when traditional manual test design remains indispensable.
Authors
- Leon Kogler
- Stefan Hangler
- Maximilian Ehrhart
- Benedikt Dornauer
- Roland Wuersching
- Peter Schrammel
Paper Information
- arXiv ID: 2604.25862v1
- Categories: cs.SE, cs.AI
- Published: April 28, 2026
- PDF: Download PDF