[Paper] Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting

Published: (February 12, 2026 at 01:42 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.12256v1

Overview

The paper explores how large language models (LLMs) like GPT‑4o can be coaxed into writing better unit tests by feeding them a handful of example tests (few‑shot prompting). By comparing human‑written, search‑based (SBST), and LLM‑generated examples, the authors show that the right mix of prompts can boost test correctness, coverage, and readability—key qualities for real‑world codebases that already blend human and AI‑produced tests.

Key Contributions

  • Few‑shot prompting framework for unit‑test generation that works with heterogeneous example sources (human, SBST, LLM).
  • Empirical evaluation on two benchmark suites (HumanEval & ClassEval) using GPT‑4o, the model behind GitHub Copilot.
  • Retrieval‑based example selection method that ranks candidate prompts by combined similarity of problem description and source code.
  • Multi‑dimensional quality metrics (correctness, coverage, readability, cognitive complexity, maintainability) rather than just pass/fail.
  • Evidence that human‑written examples consistently yield the highest coverage and correctness, while similarity‑based retrieval gives the most robust few‑shot prompts across all sources.

Methodology

  1. Dataset preparation – The authors used HumanEval (Python functions) and ClassEval (Java classes) as test beds. Each target function/class had a natural‑language description and a reference implementation.
  2. Prompt construction – For each target, a prompt was built that (a) described the problem, (b) included k example tests (k = 1‑3) drawn from one of three pools: human‑written, SBST‑generated, or LLM‑generated.
  3. Example retrieval – A lightweight retrieval pipeline computed a similarity score between the target’s description+code and each candidate example, selecting the top‑scoring examples for the prompt.
  4. Test generation – GPT‑4o was invoked with the constructed prompt to produce a new unit test.
  5. Evaluation metrics – Generated tests were run against the reference implementation to measure:
    • Correctness (does the test pass/fail as expected)
    • Code coverage (statement/branch coverage)
    • Readability (lexical and structural metrics)
    • Cognitive complexity (nesting, branching)
    • Maintainability (size, naming consistency)
  6. Statistical analysis – Results across prompt types and retrieval strategies were compared using paired t‑tests and effect‑size calculations.

Results & Findings

Prompt sourceAvg. correctness ↑Avg. coverage ↑Readability score ↑
Human examples92 %84 %0.78
LLM examples86 %78 %0.71
SBST examples78 %70 %0.65
  • Human‑written examples consistently produced the highest correctness and coverage, confirming that developers’ intuition still beats automated generators for guiding LLMs.
  • Similarity‑based retrieval (combining description‑code similarity) outperformed random or purely description‑based selection, yielding a 5‑7 % lift in coverage across all source pools.
  • Readability & maintainability of generated tests were comparable to human‑written tests when the prompt included at least two human examples, suggesting that few‑shot prompting can mitigate the “synthetic” feel of pure SBST output.
  • The approach required only a few seconds per test on commodity cloud GPUs, making it practical for CI pipelines.

Practical Implications

  • CI/CD integration – Teams can augment existing test suites automatically during pull‑request checks by pulling a handful of recent human tests from the same repository as few‑shot examples.
  • Test‑suite maintenance – Generated tests inherit the naming conventions and style of the supplied examples, reducing the need for post‑generation linting.
  • Hybrid codebases – Projects that already mix Copilot‑generated tests with hand‑written ones can now systematically improve the AI‑generated portion without manual curation.
  • Developer productivity – Instead of writing every unit test from scratch, a developer can prompt the LLM with a few relevant examples and obtain a ready‑to‑run test in seconds, freeing time for higher‑level design work.
  • Tooling roadmap – IDE plugins (e.g., Copilot, IntelliJ) could expose a “Boost Test Suite” button that automatically selects the most similar existing tests and runs a few‑shot generation pass.

Limitations & Future Work

  • Dataset scope – Experiments were limited to Python (HumanEval) and Java (ClassEval); results may differ for other languages or large‑scale monorepos.
  • Prompt size – The study capped examples at three; scaling to larger prompt windows could further improve quality but may hit token limits.
  • Retrieval simplicity – The similarity metric is a basic TF‑IDF + code token overlap; more sophisticated embeddings (e.g., CodeBERT) could yield better example selection.
  • Human evaluation – Readability and maintainability were measured with automated metrics; a user study would validate perceived quality.
  • Security considerations – Automatically generated tests may inadvertently expose internal APIs; future work should explore safe‑guarding mechanisms.

Overall, the paper demonstrates that a modest amount of well‑chosen example tests can dramatically lift the usefulness of LLM‑generated unit tests, offering a practical pathway for developers to blend AI assistance into their everyday testing workflow.

Authors

  • Alex Chudic
  • Gül Çalıklı

Paper Information

  • arXiv ID: 2602.12256v1
  • Categories: cs.SE
  • Published: February 12, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »