[Paper] Generating REST API Tests With Descriptive Names
Source: arXiv - 2512.01690v1
Overview
The paper tackles a surprisingly common annoyance for developers working with automatically generated REST‑API tests: the tests come with opaque names like test0 or test1. By proposing deterministic techniques that produce human‑readable, descriptive names, the authors show that test suites can become far easier to understand, maintain, and adopt in real projects.
Key Contributions
- Three deterministic naming algorithms (rule‑based heuristics) specifically designed for REST‑API tests generated by the EvoMaster fuzzer.
- Comprehensive empirical comparison of eight naming approaches (the three new rules, five existing methods, including LLM‑based models such as Gemini, GPT‑4o, and GPT‑3.5).
- Two large‑scale surveys (up to 39 participants) to assess perceived clarity and usefulness of the generated names.
- Industrial validation with Volkswagen AG developers, evaluating 74 test cases across four APIs and confirming real‑world readability gains.
- Evidence that lightweight deterministic methods can match or beat heavyweight LLMs in naming quality while avoiding the cost, latency, and security concerns of calling external AI services.
Methodology
- Test Generation – EvoMaster, an open‑source REST‑API fuzzing tool, produced 10 test cases for each of nine open‑source APIs (≈ 90 tests total).
- Naming Techniques –
- Rule‑based: extract HTTP method, endpoint path, status code, and key payload fields to build a concise sentence (e.g.,
GET_UserById_Returns200_WithName). - LLM‑based: prompt GPT‑3.5, GPT‑4o, Gemini, etc., with the raw test code and ask them to suggest a name.
- Hybrid/heuristic: combine simple string manipulation with frequency‑based token selection.
- Rule‑based: extract HTTP method, endpoint path, status code, and key payload fields to build a concise sentence (e.g.,
- Human Evaluation – Two separate surveys asked participants to rate each name on clarity, conciseness, and usefulness for debugging on a 5‑point Likert scale.
- Industrial Case Study – Four Volkswagen developers used EvoMaster on four internal APIs, then answered a questionnaire about the readability and maintenance impact of the generated names.
- Statistical Analysis – Non‑parametric tests (Wilcoxon signed‑rank) compared median scores across techniques; effect sizes were reported to gauge practical significance.
Results & Findings
| Technique | Median Clarity Score (out of 5) | Relative Performance |
|---|---|---|
| Rule‑based heuristic (best deterministic) | 4.3 | On par with GPT‑4o (4.4) and Gemini (4.4) |
| GPT‑4o (LLM) | 4.4 | Slight edge, but not statistically significant vs. rule‑based |
| Gemini | 4.4 | Same as GPT‑4o |
| GPT‑3.5 | 3.2 | Significantly worse (p < 0.01) |
Naïve heuristics (e.g., test0) | 2.1 | Baseline |
- Deterministic rule‑based names achieved the highest clarity among non‑LLM methods and were statistically indistinguishable from the top LLMs.
- LLM cost: rule‑based naming required < 1 ms per test, while GPT‑4o calls averaged ~ 250 ms and incurred API usage fees.
- Industrial feedback: 92 % of Volkswagen participants agreed that descriptive names “made it easier to locate failing tests” and “reduced onboarding time for new team members.”
Practical Implications
- Immediate adoption – Projects already using EvoMaster (or similar API fuzzers) can plug in the rule‑based naming module with a single configuration change; no external API keys or latency penalties.
- Better CI/CD diagnostics – Descriptive test names surface in build logs and dashboards, allowing developers to pinpoint failing scenarios without digging into generated code.
- Facilitates test maintenance – When tests are stored in version control, meaningful names survive refactors, making it easier to track which API contract changes broke which behavior.
- Cost‑effective alternative to LLMs – Organizations with strict data‑privacy policies (e.g., automotive, finance) can avoid sending proprietary request/response payloads to cloud LLM services while still gaining most of the readability benefit.
- Extensible to other domains – The rule‑based pattern (method + endpoint + status + key fields) can be adapted for GraphQL, gRPC, or even UI‑level test generation tools.
Limitations & Future Work
- Scope limited to REST‑style APIs; naming heuristics may need redesign for event‑driven or streaming services.
- Survey participants were self‑selected (LinkedIn recruitment), which could bias clarity ratings toward more enthusiastic developers.
- Rule coverage – Extremely complex endpoints (deeply nested resources, dynamic path parameters) sometimes produced overly long names; a truncation strategy is needed.
- Future directions suggested by the authors include:
- Integrating static analysis to capture business‑level intent (e.g., “creates user with admin role”).
- Evaluating the approach on larger industrial test suites.
- Exploring hybrid models that use a lightweight on‑prem LLM to handle edge cases where rule‑based naming falls short.
Authors
- Philip Garrett
- Juan P. Galeotti
- Andrea Arcuri
- Alexander Poth
- Olsi Rrjolli
Paper Information
- arXiv ID: 2512.01690v1
- Categories: cs.SE
- Published: December 1, 2025
- PDF: Download PDF