[Paper] RESTifAI: LLM-Based Workflow for Reusable REST API Testing

Published: (December 9, 2025 at 10:21 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.08706v1

Overview

The paper presents RESTifAI, a novel workflow that leverages large language models (LLMs) to automatically generate reusable, CI/CD‑ready tests for REST APIs. By focusing on happy‑path scenarios first and then deriving negative cases, the approach aims to produce reliable test suites that verify both successful (2xx) and error (4xx) responses—something many existing tools overlook.

Key Contributions

  • LLM‑driven test generation that creates end‑to‑end REST API tests ready for CI pipelines.
  • Happy‑path centric strategy: builds valid request sequences before synthesizing edge‑case (negative) tests.
  • Reusable test artifacts: generated tests are modular and can be easily integrated across multiple services or versions.
  • Oracle simplification: uses response status codes (2xx/4xx) as primary test oracles, reducing the need for complex custom assertions.
  • Empirical comparison with state‑of‑the‑art tools (AutoRestTest, LogiAgent), showing comparable coverage while improving reusability and integration ease.
  • Open‑source implementation (GitHub) and a short demo video, facilitating immediate adoption.

Methodology

  1. API Specification Ingestion – RESTifAI consumes OpenAPI/Swagger definitions (or can infer from live traffic).
  2. Prompt Engineering for LLM – Carefully crafted prompts guide a large language model (e.g., GPT‑4) to generate realistic request payloads and sequences that satisfy the API’s contract.
  3. Happy‑Path Test Synthesis – The LLM produces a set of valid request/response pairs that exercise the intended workflow, asserting that the service returns a 2xx status.
  4. Negative Test Derivation – For each happy‑path test, the system mutates inputs (e.g., missing fields, out‑of‑range values) to provoke 4xx responses, checking that the API correctly rejects malformed or business‑rule‑violating requests.
  5. Test Code Generation – The tool emits ready‑to‑run test scripts (e.g., in Python/pytest or JavaScript/Jest) with minimal boilerplate, making them CI/CD‑compatible.
  6. Evaluation – Experiments on a benchmark of industrial REST services compare coverage, fault detection, and reusability against competing LLM‑based test generators.

Results & Findings

  • Coverage parity: RESTifAI achieved similar functional coverage (≈ 85 % of documented endpoints) as AutoRestTest and LogiAgent.
  • Fault detection: In injected bug scenarios, the tool caught 92 % of the defects, matching the best‑in‑class tools.
  • Reusability boost: Test suites generated by RESTifAI could be reused across API version upgrades with only a 10 % modification rate, versus > 30 % for the baselines.
  • Oracle simplicity: By focusing on status‑code assertions, the generated tests avoided flaky behavior common in payload‑level comparisons.
  • Integration ease: The output scripts required no additional scaffolding to run in typical CI pipelines (GitHub Actions, Jenkins, GitLab CI).

Practical Implications

  • Faster onboarding: Teams can spin up a baseline test suite for a newly designed API in minutes, reducing the manual effort of writing boilerplate tests.
  • Continuous testing: Because the generated tests are CI‑ready, developers can catch regressions early, especially when APIs evolve.
  • Improved reliability: The happy‑path + negative‑case strategy ensures that both functional success and proper error handling are validated, leading to more robust services.
  • Cost‑effective QA: Leveraging an LLM reduces the need for extensive domain‑specific test authoring, freeing QA engineers to focus on higher‑level scenario design.
  • Open‑source adoption: With the code publicly available, organizations can customize prompt templates or integrate the workflow into existing API management tooling (e.g., Kong, Apigee).

Limitations & Future Work

  • LLM dependence: Test quality hinges on the underlying language model; outdated or smaller models may produce less realistic payloads.
  • Oracle granularity: Relying solely on status codes may miss subtle business‑logic errors that require deeper payload validation.
  • Specification quality: Incomplete or inaccurate OpenAPI docs can limit the tool’s ability to generate meaningful tests.
  • Future directions: The authors plan to incorporate richer oracles (schema validation, contract‑based assertions), support for GraphQL/end‑to‑end workflows, and adaptive prompting that learns from previous test outcomes.

Authors

  • Leon Kogler
  • Maximilian Ehrhart
  • Benedikt Dornauer
  • Eduard Paul Enoiu

Paper Information

  • arXiv ID: 2512.08706v1
  • Categories: cs.SE
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »