[Paper] Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Published: (April 27, 2026 at 01:07 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.24703v1

Overview

The paper Defective Task Descriptions in LLM‑Based Code Generation: Detection and Analysis examines a hidden but critical problem: when developers feed large language models (LLMs) with vague, incomplete, or malformed task specifications, the generated code often fails. The authors introduce SpecValidator, a lightweight classifier that automatically spots three common defect types in task descriptions, and they show how fixing these defects can dramatically improve code‑generation reliability.

Key Contributions

  • SpecValidator: a parameter‑efficiently fine‑tuned small model that detects lexical vagueness, under‑specification, and syntax/formatting errors in natural‑language programming prompts.
  • Comprehensive benchmark evaluation across three datasets with varying description complexity, demonstrating robust detection performance (F1 = 0.804, MCC = 0.745).
  • Empirical analysis showing that LLM code‑generation robustness hinges more on prompt quality than on model size; under‑specification is the most damaging defect.
  • Generalisation evidence: SpecValidator can uncover previously unknown under‑specification issues in real‑world benchmark prompts.
  • Insightful guidelines on how richer contextual grounding (e.g., LiveCodeBench) mitigates the impact of defective prompts.

Methodology

  1. Defect Taxonomy – The authors define three concrete defect categories:

    • Lexical Vagueness: ambiguous wording, synonyms, or missing key terms.
    • Under‑Specification: missing essential constraints, inputs, or expected behavior.
    • Syntax‑Formatting: malformed markdown, code fences, or broken bullet lists.
  2. Data Collection – Existing code‑generation benchmarks (e.g., HumanEval, MBPP, LiveCodeBench) are annotated manually to label each description with the defect(s) it contains.

  3. Model Design – A compact transformer (≈ 80 M parameters) is fine‑tuned using parameter‑efficient adapters (LoRA) on the annotated data, keeping training cost low while preserving the base model’s knowledge.

  4. Evaluation Protocol

    • Metrics: F1‑score and Matthews Correlation Coefficient (MCC) to capture both precision/recall balance and correlation with true labels.
    • Baselines: GPT‑5‑mini and Claude Sonnet 4 prompted to perform the same binary classification.
    • Cross‑benchmark tests to assess generalisation to unseen description styles.
  5. Impact Study – The authors feed defective vs. cleaned prompts to several code‑generation LLMs (including GPT‑4, Claude 2, and open‑source CodeLlama) and measure the change in pass@k scores.

Results & Findings

Model / SettingF1MCC
SpecValidator (proposed)0.8040.745
GPT‑5‑mini (prompt‑based)0.4690.281
Claude Sonnet 4 (prompt‑based)0.5180.359
  • Detection superiority: SpecValidator outperforms two state‑of‑the‑art LLMs by a large margin, despite being far smaller.
  • Generalisation: When evaluated on unseen description styles, the classifier still maintains > 0.75 F1, and it discovers hidden under‑specification defects in the original benchmark prompts.
  • Effect on code generation: Cleaning just 10 % of under‑specified prompts raises GPT‑4’s pass@1 on HumanEval by ~ 12 percentage points, confirming that prompt quality is a bottleneck.
  • Defect severity hierarchy: Under‑specification → Lexical vagueness → Syntax/formatting (most to least harmful).
  • Benchmark resilience: Datasets with richer context (e.g., LiveCodeBench includes file‑level scaffolding) suffer less performance drop when prompts are defective.

Practical Implications

  • Integrate a pre‑flight validator: Teams can embed SpecValidator (or a similar lightweight classifier) into CI pipelines or IDE extensions to flag ambiguous or incomplete task descriptions before they reach the LLM.
  • Prompt‑authoring best practices: The taxonomy gives concrete checklist items—ensure all inputs/outputs are enumerated, avoid vague adjectives, and keep markdown syntax clean.
  • Cost savings: By catching defects early, developers avoid costly “trial‑and‑error” LLM calls, reducing API usage and speeding up prototyping.
  • Tooling for education platforms: Automated homework graders that rely on LLMs can use SpecValidator to guarantee that student‑written problem statements are well‑formed, leading to fairer grading.
  • Model‑agnostic benefit: Since the defect impact is largely independent of LLM size, even smaller, on‑premise models (e.g., CodeLlama‑7B) gain reliability when paired with a prompt validator.

Limitations & Future Work

  • Scope of defects: The study focuses on three defect types; real‑world prompts may exhibit logical contradictions or domain‑specific ambiguities not covered.
  • Dataset bias: Annotated benchmarks are primarily English‑centric and derived from academic sources; industrial code bases with mixed languages may present new challenges.
  • Dynamic prompts: The classifier works on static text; interactive or multi‑turn prompt engineering (e.g., chat‑style refinement) is not yet addressed.
  • Future directions suggested by the authors include expanding the defect taxonomy, training multilingual validators, and exploring closed‑loop systems where the LLM itself suggests prompt refinements after a defect is detected.

Authors

  • Amal Akli
  • Mike Papadakis
  • Maxime Cordy
  • Yves Le Traon

Paper Information

  • arXiv ID: 2604.24703v1
  • Categories: cs.SE, cs.AI
  • Published: April 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...