[Paper] RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

Published: (April 28, 2026 at 12:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.25862v1

Overview

The paper introduces RESTestBench, a new benchmark designed to evaluate how well large language models (LLMs) can generate REST‑API test cases directly from natural‑language (NL) requirements. By moving beyond traditional code‑coverage metrics, the authors provide a way to measure whether generated tests actually validate the functional intent expressed in the requirements—a crucial step for developers who rely on AI‑assisted testing.

Key Contributions

  • RESTestBench benchmark: 3 realistic REST services paired with manually verified NL requirements, each offered in a precise and a vague version.
  • Requirements‑based mutation testing metric: Extends Bartocci et al.’s property‑based mutation approach to quantify fault‑detection power of a test case with respect to a specific requirement.
  • Empirical evaluation of two LLM‑driven strategies:
    1. Non‑refinement generation (pure prompt‑to‑test).
    2. Refinement generation (iterative interaction with the running SUT, including mutated versions).
  • Insights on refinement usefulness: Shows that exposing the generator to faulty code can actually hurt test effectiveness, especially when requirements are vague.

Methodology

  1. Benchmark construction – The authors built three open‑source REST APIs (e.g., a book‑store, a todo list, and a user‑profile service). For each endpoint they wrote two NL requirement statements: a precise description (e.g., “POST /books must reject a request when the ISBN already exists”) and a vague one (e.g., “Adding a book should work correctly”).
  2. Mutation generation – Each service was automatically mutated (e.g., by flipping a conditional, removing a validation) to create faulty versions that violate specific requirements.
  3. LLM test generation – Six state‑of‑the‑art LLMs (GPT‑4, Claude, Llama 2, etc.) were prompted to produce test cases in a popular testing framework (e.g., REST‑Assured or Postman). Two pipelines were compared:
    • Non‑refinement: Prompt → test case.
    • Refinement: Prompt → test case → run against the live SUT → feed back the response → optionally adjust the test.
  4. Evaluation metric – For each generated test, the mutation testing metric checks whether the test fails on the mutated SUT that violates the target requirement while passing on the correct implementation. This yields a requirement‑specific fault detection score (0–1).

Results & Findings

ScenarioPrecise Req.Vague Req.
Non‑refinement (baseline)0.78 avg. detection0.45 avg. detection
Refinement on correct SUT0.81 (↑)0.48 (↑)
Refinement on mutated SUT0.62 (↓)0.30 (↓)
  • Precise requirements already give LLMs a strong signal; refinement adds only a modest boost.
  • Vague requirements suffer dramatically when the generator sees a mutated implementation—tests start to over‑fit the buggy behavior and miss the intended functionality.
  • The benefit of refinement disappears entirely for vague specs, suggesting that “show me the code” may be counter‑productive unless the requirement is crystal‑clear.

Practical Implications

  • For DevOps / QA teams: RESTestBench offers a ready‑to‑use suite to benchmark any in‑house LLM‑based test generator before integrating it into CI pipelines.
  • Test‑case generation tooling: Vendors should expose a requirement‑clarity flag; when requirements are ambiguous, the tool should avoid runtime refinement and rely on pure prompting.
  • Cost‑effective testing: Developers can focus effort on writing precise, testable requirements (e.g., using Given/When/Then style) to reap the full benefits of LLM‑generated tests without needing extra execution cycles for refinement.
  • Safety‑critical APIs: The mutation‑based metric provides a concrete, requirement‑driven safety gauge that can be incorporated into compliance checklists (e.g., for fintech or healthcare APIs).

Limitations & Future Work

  • Benchmark size – Only three services were used; broader domain coverage (e.g., streaming, GraphQL) would improve generalizability.
  • LLM prompt engineering – The study kept prompts simple; exploring richer prompt templates or few‑shot examples could change the refinement dynamics.
  • Mutation realism – Automated mutations may not capture the full spectrum of real bugs developers encounter; future work could involve human‑in‑the‑loop bug injection.
  • Extending metrics – The current metric focuses on binary pass/fail per requirement; incorporating severity or multi‑requirement interactions is an open avenue.

Bottom line: RESTestBench shines a light on the missing link between AI‑generated tests and the functional intent they’re supposed to verify. By providing a concrete, requirement‑centric evaluation framework, it helps developers decide when to trust LLM‑driven testing and when traditional manual test design remains indispensable.

Authors

  • Leon Kogler
  • Stefan Hangler
  • Maximilian Ehrhart
  • Benedikt Dornauer
  • Roland Wuersching
  • Peter Schrammel

Paper Information

  • arXiv ID: 2604.25862v1
  • Categories: cs.SE, cs.AI
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...