[Paper] Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Published: 3 days ago (February 24, 2026 at 01:43 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21189v1

Overview

The paper investigates a puzzling phenomenon that many practitioners have observed when fine‑tuning large language models (LLMs) for verifiable tasks such as code generation, math problem solving, or short‑answer QA. Optimizing for the popular Pass@k metric (success if any of k sampled outputs passes a verifier) often boosts Pass@k scores but simultaneously hurts Pass@1—the success rate of a single generated answer. Since real‑world deployments usually rely on a single inference due to latency, cost, or verifier limitations, understanding and fixing this trade‑off is crucial.

Key Contributions

Theoretical analysis of gradient conflict: Shows how optimizing a Pass@k‑oriented loss can create gradients that oppose those that improve Pass@1, due to prompt interference.
Definition of “negatively interfering” prompts: Formalizes prompts whose up‑weighting during Pass@k optimization actually drags the model away from the optimal Pass@1 direction.
Proof‑of‑concept experiments: Empirical validation on large‑scale LLMs (e.g., GPT‑NeoX, LLaMA) solving verifiable math problems, confirming the predicted degradation of Pass@1.
Guidelines for practitioners: Provides actionable insights on when Pass@k optimization is safe and when it may be counter‑productive.

Methodology

Problem Formalization
- Treat each prompt (task description + few‑shot examples) as a “policy” that maps to a distribution over possible completions.
- Define two loss functions: one that directly estimates the gradient of Pass@k (by re‑weighting sampled completions) and another for Pass@1.
Gradient Conflict Analysis
- Derive the expected gradient of the Pass@k loss and decompose it into contributions from high‑success and low‑success prompts.
- Show that the Pass@k loss implicitly up‑weights low‑success prompts (because they have more “room” for improvement), which can be negatively interfering if those prompts are fundamentally harder for the model.
Experimental Setup
- Fine‑tune a pre‑trained LLM on a suite of mathematically verifiable benchmarks (e.g., MATH, GSM‑8K).
- Compare three training regimes: (a) standard supervised fine‑tuning, (b) Pass@k‑targeted optimization, (c) hybrid loss mixing Pass@1 and Pass@k.
- Evaluate both Pass@1 and Pass@k on held‑out test sets, measuring the gradient alignment between the two objectives.

Results & Findings

Training Regime	Pass@1 ↑ / ↓	Pass@k ↑ / ↓	Gradient Alignment (cos θ)
Supervised FT	baseline	baseline	—
Pass@k‑only	↓ 4–7 %	↑ 12–18 %	negative (≈ ‑0.3)
Hybrid (0.7 Pass@1, 0.3 Pass@k)	≈ 0 % (stable)	↑ 6–9 %	mildly positive (≈ 0.15)

Pass@k improves as expected when directly optimized.
Pass@1 consistently drops under pure Pass@k training, confirming the gradient conflict hypothesis.
The hybrid loss mitigates the trade‑off, suggesting a practical compromise.
Visualizing prompt‑level success rates reveals that the up‑weighted prompts during Pass@k training are indeed those with historically low Pass@1 scores—exactly the “negatively interfering” set the theory predicts.

Practical Implications

Deployment Pipelines
- If your service can afford k parallel generations (e.g., batch code synthesis), Pass@k optimization is beneficial.
- For latency‑sensitive APIs that return a single answer, stick to Pass@1‑oriented fine‑tuning or use a mixed loss to avoid hidden regressions.
Prompt Engineering
- Identify and filter out negatively interfering prompts (e.g., ambiguous problem statements) before training.
- Use the paper’s gradient‑alignment diagnostic to flag prompts that are likely to cause trade‑offs.
Cost Management
- The hybrid loss reduces the need for expensive k-sample inference while still gaining a modest Pass@k boost—useful when compute budgets are tight.
Verifier Design
- Since Pass@k relies heavily on the verifier’s coverage, improving verifier recall can lessen the incentive to up‑weight low‑success prompts, indirectly protecting Pass@1.

Limitations & Future Work

Scope of Tasks: Experiments focus on mathematical reasoning; other verifiable domains (e.g., code synthesis for large projects, factual QA) may exhibit different dynamics.
Model Scale: Findings are demonstrated on models up to ~13 B parameters; behavior on multi‑billion‑parameter LLMs (e.g., GPT‑4) remains an open question.
Verifier Imperfections: The analysis assumes a reasonably accurate verifier; noisy or biased verifiers could amplify or mask the observed trade‑off.

Future Directions

Extending the gradient‑conflict framework to adaptive k‑selection strategies.
Developing automated prompt‑filtering tools that detect negatively interfering prompts early in the data pipeline.
Exploring curriculum‑style training where low‑success prompts are introduced only after the model reaches a Pass@1 plateau.

Bottom Line

Optimizing for Pass@k isn’t a free lunch. By understanding the underlying gradient conflict and the role of negatively interfering prompts, developers can make informed choices—either embracing multi‑sample inference when it makes sense or safeguarding single‑shot performance with hybrid training strategies.

Authors

Anas Barakat
Souradip Chakraborty
Khushbu Pahwa
Amrit Singh Bedi

Paper Information

arXiv ID: 2602.21189v1
Categories: cs.LG, cs.AI
Published: February 24, 2026
PDF: Download PDF

[Paper] Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Future Directions

Bottom Line

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport