[Paper] Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
Source: arXiv - 2604.24703v1
Overview
The paper Defective Task Descriptions in LLM‑Based Code Generation: Detection and Analysis examines a hidden but critical problem: when developers feed large language models (LLMs) with vague, incomplete, or malformed task specifications, the generated code often fails. The authors introduce SpecValidator, a lightweight classifier that automatically spots three common defect types in task descriptions, and they show how fixing these defects can dramatically improve code‑generation reliability.
Key Contributions
- SpecValidator: a parameter‑efficiently fine‑tuned small model that detects lexical vagueness, under‑specification, and syntax/formatting errors in natural‑language programming prompts.
- Comprehensive benchmark evaluation across three datasets with varying description complexity, demonstrating robust detection performance (F1 = 0.804, MCC = 0.745).
- Empirical analysis showing that LLM code‑generation robustness hinges more on prompt quality than on model size; under‑specification is the most damaging defect.
- Generalisation evidence: SpecValidator can uncover previously unknown under‑specification issues in real‑world benchmark prompts.
- Insightful guidelines on how richer contextual grounding (e.g., LiveCodeBench) mitigates the impact of defective prompts.
Methodology
-
Defect Taxonomy – The authors define three concrete defect categories:
- Lexical Vagueness: ambiguous wording, synonyms, or missing key terms.
- Under‑Specification: missing essential constraints, inputs, or expected behavior.
- Syntax‑Formatting: malformed markdown, code fences, or broken bullet lists.
-
Data Collection – Existing code‑generation benchmarks (e.g., HumanEval, MBPP, LiveCodeBench) are annotated manually to label each description with the defect(s) it contains.
-
Model Design – A compact transformer (≈ 80 M parameters) is fine‑tuned using parameter‑efficient adapters (LoRA) on the annotated data, keeping training cost low while preserving the base model’s knowledge.
-
Evaluation Protocol –
- Metrics: F1‑score and Matthews Correlation Coefficient (MCC) to capture both precision/recall balance and correlation with true labels.
- Baselines: GPT‑5‑mini and Claude Sonnet 4 prompted to perform the same binary classification.
- Cross‑benchmark tests to assess generalisation to unseen description styles.
-
Impact Study – The authors feed defective vs. cleaned prompts to several code‑generation LLMs (including GPT‑4, Claude 2, and open‑source CodeLlama) and measure the change in pass@k scores.
Results & Findings
| Model / Setting | F1 | MCC |
|---|---|---|
| SpecValidator (proposed) | 0.804 | 0.745 |
| GPT‑5‑mini (prompt‑based) | 0.469 | 0.281 |
| Claude Sonnet 4 (prompt‑based) | 0.518 | 0.359 |
- Detection superiority: SpecValidator outperforms two state‑of‑the‑art LLMs by a large margin, despite being far smaller.
- Generalisation: When evaluated on unseen description styles, the classifier still maintains > 0.75 F1, and it discovers hidden under‑specification defects in the original benchmark prompts.
- Effect on code generation: Cleaning just 10 % of under‑specified prompts raises GPT‑4’s pass@1 on HumanEval by ~ 12 percentage points, confirming that prompt quality is a bottleneck.
- Defect severity hierarchy: Under‑specification → Lexical vagueness → Syntax/formatting (most to least harmful).
- Benchmark resilience: Datasets with richer context (e.g., LiveCodeBench includes file‑level scaffolding) suffer less performance drop when prompts are defective.
Practical Implications
- Integrate a pre‑flight validator: Teams can embed SpecValidator (or a similar lightweight classifier) into CI pipelines or IDE extensions to flag ambiguous or incomplete task descriptions before they reach the LLM.
- Prompt‑authoring best practices: The taxonomy gives concrete checklist items—ensure all inputs/outputs are enumerated, avoid vague adjectives, and keep markdown syntax clean.
- Cost savings: By catching defects early, developers avoid costly “trial‑and‑error” LLM calls, reducing API usage and speeding up prototyping.
- Tooling for education platforms: Automated homework graders that rely on LLMs can use SpecValidator to guarantee that student‑written problem statements are well‑formed, leading to fairer grading.
- Model‑agnostic benefit: Since the defect impact is largely independent of LLM size, even smaller, on‑premise models (e.g., CodeLlama‑7B) gain reliability when paired with a prompt validator.
Limitations & Future Work
- Scope of defects: The study focuses on three defect types; real‑world prompts may exhibit logical contradictions or domain‑specific ambiguities not covered.
- Dataset bias: Annotated benchmarks are primarily English‑centric and derived from academic sources; industrial code bases with mixed languages may present new challenges.
- Dynamic prompts: The classifier works on static text; interactive or multi‑turn prompt engineering (e.g., chat‑style refinement) is not yet addressed.
- Future directions suggested by the authors include expanding the defect taxonomy, training multilingual validators, and exploring closed‑loop systems where the LLM itself suggests prompt refinements after a defect is detected.
Authors
- Amal Akli
- Mike Papadakis
- Maxime Cordy
- Yves Le Traon
Paper Information
- arXiv ID: 2604.24703v1
- Categories: cs.SE, cs.AI
- Published: April 27, 2026
- PDF: Download PDF