[Paper] Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Published: 1 day ago (April 27, 2026 at 01:07 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.24703v1

Overview

The paper Defective Task Descriptions in LLM‑Based Code Generation: Detection and Analysis examines a hidden but critical problem: when developers feed large language models (LLMs) with vague, incomplete, or malformed task specifications, the generated code often fails. The authors introduce SpecValidator, a lightweight classifier that automatically spots three common defect types in task descriptions, and they show how fixing these defects can dramatically improve code‑generation reliability.

Key Contributions

SpecValidator: a parameter‑efficiently fine‑tuned small model that detects lexical vagueness, under‑specification, and syntax/formatting errors in natural‑language programming prompts.
Comprehensive benchmark evaluation across three datasets with varying description complexity, demonstrating robust detection performance (F1 = 0.804, MCC = 0.745).
Empirical analysis showing that LLM code‑generation robustness hinges more on prompt quality than on model size; under‑specification is the most damaging defect.
Generalisation evidence: SpecValidator can uncover previously unknown under‑specification issues in real‑world benchmark prompts.
Insightful guidelines on how richer contextual grounding (e.g., LiveCodeBench) mitigates the impact of defective prompts.

Methodology

Defect Taxonomy – The authors define three concrete defect categories:
- Lexical Vagueness: ambiguous wording, synonyms, or missing key terms.
- Under‑Specification: missing essential constraints, inputs, or expected behavior.
- Syntax‑Formatting: malformed markdown, code fences, or broken bullet lists.
Data Collection – Existing code‑generation benchmarks (e.g., HumanEval, MBPP, LiveCodeBench) are annotated manually to label each description with the defect(s) it contains.
Model Design – A compact transformer (≈ 80 M parameters) is fine‑tuned using parameter‑efficient adapters (LoRA) on the annotated data, keeping training cost low while preserving the base model’s knowledge.
Evaluation Protocol –
- Metrics: F1‑score and Matthews Correlation Coefficient (MCC) to capture both precision/recall balance and correlation with true labels.
- Baselines: GPT‑5‑mini and Claude Sonnet 4 prompted to perform the same binary classification.
- Cross‑benchmark tests to assess generalisation to unseen description styles.
Impact Study – The authors feed defective vs. cleaned prompts to several code‑generation LLMs (including GPT‑4, Claude 2, and open‑source CodeLlama) and measure the change in pass@k scores.

Results & Findings

Model / Setting	F1	MCC
SpecValidator (proposed)	0.804	0.745
GPT‑5‑mini (prompt‑based)	0.469	0.281
Claude Sonnet 4 (prompt‑based)	0.518	0.359

Detection superiority: SpecValidator outperforms two state‑of‑the‑art LLMs by a large margin, despite being far smaller.
Generalisation: When evaluated on unseen description styles, the classifier still maintains > 0.75 F1, and it discovers hidden under‑specification defects in the original benchmark prompts.
Effect on code generation: Cleaning just 10 % of under‑specified prompts raises GPT‑4’s pass@1 on HumanEval by ~ 12 percentage points, confirming that prompt quality is a bottleneck.
Defect severity hierarchy: Under‑specification → Lexical vagueness → Syntax/formatting (most to least harmful).
Benchmark resilience: Datasets with richer context (e.g., LiveCodeBench includes file‑level scaffolding) suffer less performance drop when prompts are defective.

Practical Implications

Integrate a pre‑flight validator: Teams can embed SpecValidator (or a similar lightweight classifier) into CI pipelines or IDE extensions to flag ambiguous or incomplete task descriptions before they reach the LLM.
Prompt‑authoring best practices: The taxonomy gives concrete checklist items—ensure all inputs/outputs are enumerated, avoid vague adjectives, and keep markdown syntax clean.
Cost savings: By catching defects early, developers avoid costly “trial‑and‑error” LLM calls, reducing API usage and speeding up prototyping.
Tooling for education platforms: Automated homework graders that rely on LLMs can use SpecValidator to guarantee that student‑written problem statements are well‑formed, leading to fairer grading.
Model‑agnostic benefit: Since the defect impact is largely independent of LLM size, even smaller, on‑premise models (e.g., CodeLlama‑7B) gain reliability when paired with a prompt validator.

Limitations & Future Work

Scope of defects: The study focuses on three defect types; real‑world prompts may exhibit logical contradictions or domain‑specific ambiguities not covered.
Dataset bias: Annotated benchmarks are primarily English‑centric and derived from academic sources; industrial code bases with mixed languages may present new challenges.
Dynamic prompts: The classifier works on static text; interactive or multi‑turn prompt engineering (e.g., chat‑style refinement) is not yet addressed.
Future directions suggested by the authors include expanding the defect taxonomy, training multilingual validators, and exploring closed‑loop systems where the LLM itself suggests prompt refinements after a defect is detected.

Authors

Amal Akli
Mike Papadakis
Maxime Cordy
Yves Le Traon

Paper Information

arXiv ID: 2604.24703v1
Categories: cs.SE, cs.AI
Published: April 27, 2026
PDF: Download PDF

[Paper] Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

[Paper] Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

[Paper] Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models