[Paper] Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Published: (February 24, 2026 at 01:43 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.21189v1

Overview

The paper investigates a puzzling phenomenon that many practitioners have observed when fine‑tuning large language models (LLMs) for verifiable tasks such as code generation, math problem solving, or short‑answer QA. Optimizing for the popular Pass@k metric (success if any of k sampled outputs passes a verifier) often boosts Pass@k scores but simultaneously hurts Pass@1—the success rate of a single generated answer. Since real‑world deployments usually rely on a single inference due to latency, cost, or verifier limitations, understanding and fixing this trade‑off is crucial.

Key Contributions

  • Theoretical analysis of gradient conflict: Shows how optimizing a Pass@k‑oriented loss can create gradients that oppose those that improve Pass@1, due to prompt interference.
  • Definition of “negatively interfering” prompts: Formalizes prompts whose up‑weighting during Pass@k optimization actually drags the model away from the optimal Pass@1 direction.
  • Proof‑of‑concept experiments: Empirical validation on large‑scale LLMs (e.g., GPT‑NeoX, LLaMA) solving verifiable math problems, confirming the predicted degradation of Pass@1.
  • Guidelines for practitioners: Provides actionable insights on when Pass@k optimization is safe and when it may be counter‑productive.

Methodology

  1. Problem Formalization

    • Treat each prompt (task description + few‑shot examples) as a “policy” that maps to a distribution over possible completions.
    • Define two loss functions: one that directly estimates the gradient of Pass@k (by re‑weighting sampled completions) and another for Pass@1.
  2. Gradient Conflict Analysis

    • Derive the expected gradient of the Pass@k loss and decompose it into contributions from high‑success and low‑success prompts.
    • Show that the Pass@k loss implicitly up‑weights low‑success prompts (because they have more “room” for improvement), which can be negatively interfering if those prompts are fundamentally harder for the model.
  3. Experimental Setup

    • Fine‑tune a pre‑trained LLM on a suite of mathematically verifiable benchmarks (e.g., MATH, GSM‑8K).
    • Compare three training regimes: (a) standard supervised fine‑tuning, (b) Pass@k‑targeted optimization, (c) hybrid loss mixing Pass@1 and Pass@k.
    • Evaluate both Pass@1 and Pass@k on held‑out test sets, measuring the gradient alignment between the two objectives.

Results & Findings

Training RegimePass@1 ↑ / ↓Pass@k ↑ / ↓Gradient Alignment (cos θ)
Supervised FTbaselinebaseline
Pass@k‑only↓ 4–7 %↑ 12–18 %negative (≈ ‑0.3)
Hybrid (0.7 Pass@1, 0.3 Pass@k)≈ 0 % (stable)↑ 6–9 %mildly positive (≈ 0.15)
  • Pass@k improves as expected when directly optimized.
  • Pass@1 consistently drops under pure Pass@k training, confirming the gradient conflict hypothesis.
  • The hybrid loss mitigates the trade‑off, suggesting a practical compromise.
  • Visualizing prompt‑level success rates reveals that the up‑weighted prompts during Pass@k training are indeed those with historically low Pass@1 scores—exactly the “negatively interfering” set the theory predicts.

Practical Implications

  1. Deployment Pipelines

    • If your service can afford k parallel generations (e.g., batch code synthesis), Pass@k optimization is beneficial.
    • For latency‑sensitive APIs that return a single answer, stick to Pass@1‑oriented fine‑tuning or use a mixed loss to avoid hidden regressions.
  2. Prompt Engineering

    • Identify and filter out negatively interfering prompts (e.g., ambiguous problem statements) before training.
    • Use the paper’s gradient‑alignment diagnostic to flag prompts that are likely to cause trade‑offs.
  3. Cost Management

    • The hybrid loss reduces the need for expensive k-sample inference while still gaining a modest Pass@k boost—useful when compute budgets are tight.
  4. Verifier Design

    • Since Pass@k relies heavily on the verifier’s coverage, improving verifier recall can lessen the incentive to up‑weight low‑success prompts, indirectly protecting Pass@1.

Limitations & Future Work

  • Scope of Tasks: Experiments focus on mathematical reasoning; other verifiable domains (e.g., code synthesis for large projects, factual QA) may exhibit different dynamics.
  • Model Scale: Findings are demonstrated on models up to ~13 B parameters; behavior on multi‑billion‑parameter LLMs (e.g., GPT‑4) remains an open question.
  • Verifier Imperfections: The analysis assumes a reasonably accurate verifier; noisy or biased verifiers could amplify or mask the observed trade‑off.

Future Directions

  • Extending the gradient‑conflict framework to adaptive k‑selection strategies.
  • Developing automated prompt‑filtering tools that detect negatively interfering prompts early in the data pipeline.
  • Exploring curriculum‑style training where low‑success prompts are introduced only after the model reaches a Pass@1 plateau.

Bottom Line

Optimizing for Pass@k isn’t a free lunch. By understanding the underlying gradient conflict and the role of negatively interfering prompts, developers can make informed choices—either embracing multi‑sample inference when it makes sense or safeguarding single‑shot performance with hybrid training strategies.

Authors

  • Anas Barakat
  • Souradip Chakraborty
  • Khushbu Pahwa
  • Amrit Singh Bedi

Paper Information

  • arXiv ID: 2602.21189v1
  • Categories: cs.LG, cs.AI
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...