[Paper] RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

Published: 20 hours ago (April 28, 2026 at 12:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.25862v1

Overview

The paper introduces RESTestBench, a new benchmark designed to evaluate how well large language models (LLMs) can generate REST‑API test cases directly from natural‑language (NL) requirements. By moving beyond traditional code‑coverage metrics, the authors provide a way to measure whether generated tests actually validate the functional intent expressed in the requirements—a crucial step for developers who rely on AI‑assisted testing.

Key Contributions

RESTestBench benchmark: 3 realistic REST services paired with manually verified NL requirements, each offered in a precise and a vague version.
Requirements‑based mutation testing metric: Extends Bartocci et al.’s property‑based mutation approach to quantify fault‑detection power of a test case with respect to a specific requirement.
Empirical evaluation of two LLM‑driven strategies:
1. Non‑refinement generation (pure prompt‑to‑test).
2. Refinement generation (iterative interaction with the running SUT, including mutated versions).
Insights on refinement usefulness: Shows that exposing the generator to faulty code can actually hurt test effectiveness, especially when requirements are vague.

Methodology

Benchmark construction – The authors built three open‑source REST APIs (e.g., a book‑store, a todo list, and a user‑profile service). For each endpoint they wrote two NL requirement statements: a precise description (e.g., “POST /books must reject a request when the ISBN already exists”) and a vague one (e.g., “Adding a book should work correctly”).
Mutation generation – Each service was automatically mutated (e.g., by flipping a conditional, removing a validation) to create faulty versions that violate specific requirements.
LLM test generation – Six state‑of‑the‑art LLMs (GPT‑4, Claude, Llama 2, etc.) were prompted to produce test cases in a popular testing framework (e.g., REST‑Assured or Postman). Two pipelines were compared:
- Non‑refinement: Prompt → test case.
- Refinement: Prompt → test case → run against the live SUT → feed back the response → optionally adjust the test.
Evaluation metric – For each generated test, the mutation testing metric checks whether the test fails on the mutated SUT that violates the target requirement while passing on the correct implementation. This yields a requirement‑specific fault detection score (0–1).

Results & Findings

Scenario	Precise Req.	Vague Req.
Non‑refinement (baseline)	0.78 avg. detection	0.45 avg. detection
Refinement on correct SUT	0.81 (↑)	0.48 (↑)
Refinement on mutated SUT	0.62 (↓)	0.30 (↓)

Precise requirements already give LLMs a strong signal; refinement adds only a modest boost.
Vague requirements suffer dramatically when the generator sees a mutated implementation—tests start to over‑fit the buggy behavior and miss the intended functionality.
The benefit of refinement disappears entirely for vague specs, suggesting that “show me the code” may be counter‑productive unless the requirement is crystal‑clear.

Practical Implications

For DevOps / QA teams: RESTestBench offers a ready‑to‑use suite to benchmark any in‑house LLM‑based test generator before integrating it into CI pipelines.
Test‑case generation tooling: Vendors should expose a requirement‑clarity flag; when requirements are ambiguous, the tool should avoid runtime refinement and rely on pure prompting.
Cost‑effective testing: Developers can focus effort on writing precise, testable requirements (e.g., using Given/When/Then style) to reap the full benefits of LLM‑generated tests without needing extra execution cycles for refinement.
Safety‑critical APIs: The mutation‑based metric provides a concrete, requirement‑driven safety gauge that can be incorporated into compliance checklists (e.g., for fintech or healthcare APIs).

Limitations & Future Work

Benchmark size – Only three services were used; broader domain coverage (e.g., streaming, GraphQL) would improve generalizability.
LLM prompt engineering – The study kept prompts simple; exploring richer prompt templates or few‑shot examples could change the refinement dynamics.
Mutation realism – Automated mutations may not capture the full spectrum of real bugs developers encounter; future work could involve human‑in‑the‑loop bug injection.
Extending metrics – The current metric focuses on binary pass/fail per requirement; incorporating severity or multi‑requirement interactions is an open avenue.

Bottom line: RESTestBench shines a light on the missing link between AI‑generated tests and the functional intent they’re supposed to verify. By providing a concrete, requirement‑centric evaluation framework, it helps developers decide when to trust LLM‑driven testing and when traditional manual test design remains indispensable.

Authors

Leon Kogler
Stefan Hangler
Maximilian Ehrhart
Benedikt Dornauer
Roland Wuersching
Peter Schrammel

Paper Information

arXiv ID: 2604.25862v1
Categories: cs.SE, cs.AI
Published: April 28, 2026
PDF: Download PDF

[Paper] RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

[Paper] Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

[Paper] Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models