[Paper] When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Source: arXiv - 2604.24712v1
Overview
Large language models (LLMs) are now a staple for automatic code generation, but the quality of the produced code hinges heavily on how the programming task is described. This paper investigates a surprising twist: leaving parts of the prompt vague (under‑specification) can sometimes improve the correctness of the generated code, especially when the prompt contains rich, redundant information. The authors compare two benchmark families— the terse HumanEval suite and the more verbose LiveCodeBench—to show that prompt structure matters far more than the raw capability of the model.
Key Contributions
- Empirical comparison across 10 LLMs on both minimal (HumanEval) and structurally rich (LiveCodeBench) code‑generation benchmarks.
- Demonstration that under‑specification is not uniformly harmful: the same vague prompt mutation that hurts performance on HumanEval has almost no impact on LiveCodeBench.
- Discovery that certain under‑specifications can increase correctness by breaking misleading lexical or structural cues that otherwise trigger wrong, retrieval‑style solutions.
- Manual taxonomy of beneficial prompt modifications, e.g., removing over‑fitted terminology, discarding spurious constraints, and eliminating identifier triggers.
- Practical guidelines for writing robust prompts that leverage redundancy to mitigate brittleness and even boost accuracy.
Methodology
-
Benchmarks
- HumanEval: 164 short Python functions with a single natural‑language description and a few test cases—minimal redundancy.
- LiveCodeBench: 1,000+ tasks that include a detailed problem statement, explicit constraints, example I/O pairs, and sometimes a high‑level algorithmic hint—high redundancy.
-
Prompt Mutations
- Systematically removed or paraphrased parts of the prompt (e.g., omitted constraints, shortened descriptions, altered variable names).
- Created “under‑specified” variants that deliberately left out information that would normally be present.
-
Models Tested
- Ten state‑of‑the‑art LLMs ranging from open‑source (LLaMA‑2, Mistral) to closed‑source (GPT‑4, Claude).
-
Evaluation
- Generated code for each prompt variant, then ran the official test suites.
- Measured pass@k (k = 1, 10, 100) and recorded any changes in correctness relative to the original prompt.
- Conducted a qualitative analysis on a subset of cases where under‑specification helped to uncover the underlying mechanisms.
Results & Findings
| Benchmark | Typical impact of under‑specification | Notable positive effect |
|---|---|---|
| HumanEval | ↓ Pass@1 by 12‑18 % on average; models become more prone to “hallucinated” solutions. | Rare; only 2 % of mutated prompts showed any gain. |
| LiveCodeBench | Near‑zero net change (±1 %); redundancy absorbs missing details. | ↑ Pass@1 by up to 7 % for several models when misleading constraints are stripped. |
- Redundancy shields against brittleness: multiple descriptions (natural language, constraints, examples) allow the model to infer missing pieces.
- Misleading cues are a bigger problem than missing information: certain lexical patterns (e.g., “use a stack”) trigger retrieval of an incorrect template; removing those cues lets the model reason from first principles.
- Model‑agnostic trend: both open‑source and proprietary models exhibited the same pattern, suggesting the phenomenon is tied to prompt design rather than model size.
Practical Implications
- Prompt design matters more than model size for many real‑world coding assistants. Teams can achieve higher reliability by embedding multiple, overlapping specifications (description, constraints, examples).
- Deliberate under‑specification can be a debugging tool: if a generated solution repeatedly fails, try stripping away overly specific wording that might be nudging the model toward a wrong pattern.
- Tooling opportunity: IDE plugins could automatically suggest “robustified” prompts—adding redundant constraints or removing potentially misleading keywords—to improve code generation success rates.
- Testing pipelines: when evaluating LLM‑based code generators, include both minimal and rich prompt variants to get a realistic picture of robustness.
Limitations & Future Work
- The study focuses on Python and on benchmark‑style tasks; applicability to other languages or large‑scale software engineering problems remains to be validated.
- Under‑specification was explored only through removal or simple paraphrasing; more sophisticated prompt transformations (e.g., multi‑turn dialogue) could yield different dynamics.
- The authors note that while redundancy helps, it also increases prompt length, which may hit token limits for some models. Future research could explore optimal redundancy levels or compression techniques.
Bottom line for developers: crafting prompts with built‑in redundancy and being mindful of potentially misleading terminology can make LLM‑driven code generation more reliable—and sometimes even more correct—than a perfectly worded but overly terse instruction. Use these insights to fine‑tune your prompts, build smarter tooling, and set realistic expectations when integrating LLMs into your development workflow.
Authors
- Amal AKLI
- Mike PAPADAKIS
- Maxime CORDY
- Yves Le TRAON
Paper Information
- arXiv ID: 2604.24712v1
- Categories: cs.SE
- Published: April 27, 2026
- PDF: Download PDF