[Paper] Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation
Source: arXiv - 2605.28022v1
Overview
Large language models (LLMs) are now a staple for automatically generating code, but evaluating them is tricky. The common metric, Pass@k, measures how many of k sampled programs actually pass a test suite. Recent reinforcement‑learning‑based verifiers (RLVR) boost correctness, yet they can cause the model to churn out many near‑duplicate solutions, wasting the limited sampling budget. This paper uncovers that redundancy problem, proposes a simple anti‑redundancy reward using the plagiarism detector JPlag, and shows it consistently improves real‑world code‑generation performance.
Key Contributions
- Redundancy analysis: First systematic study of implementation‑level duplication in sampled code using JPlag across several models and benchmarks.
- Empirical finding: Pure correctness‑only RLVR collapses diversity, concentrating generations around a few identical implementations, while Pass@k‑aware objectives keep the sample set more varied.
- Anti‑redundancy reward: Introduces a lightweight JPlag‑based similarity penalty that can be added to any RLVR objective.
- Broad validation: Experiments on 3 LLMs (e.g., CodeGen, StarCoder, GPT‑NeoX) and 3 benchmark suites (HumanEval, MBPP, and a custom LeetCode‑style set) demonstrate consistent gains in executable success under finite sampling budgets.
- Practical recipe: Shows that the anti‑redundancy term often matches or exceeds the performance of more complex Pass@k‑aware RL objectives, without needing bespoke reward engineering.
Methodology
- Baseline RLVR – The authors start from a standard verifier‑based RL setup: a generator LLM proposes a program, a verifier (often a test runner) returns a binary reward (pass/fail). The policy is updated via PPO.
- Redundancy measurement – For each sampling budget k, they collect the k generated programs and run JPlag to compute pairwise similarity scores. High average similarity indicates redundancy.
- Pass@k‑aware RLVR – As a comparison, they implement a reward that explicitly encourages at least one passing sample within the budget (e.g., using a discounted reward that decays after a pass is observed).
- Anti‑redundancy reward – They augment the RLVR loss with a term
‑λ * JPlagSim(g_i, G_{<i}), whereg_iis the current candidate andG_{<i}are previously generated samples. The similarity is normalized to [0,1]; λ controls the strength of the penalty. - Training & evaluation – Models are fine‑tuned on the same code‑generation data, then evaluated on the three benchmarks with budgets k = {1, 5, 10, 20}. Metrics reported: Pass@k, average JPlag similarity, and runtime overhead.
Results & Findings
| Model / Benchmark | Pass@5 (baseline RLVR) | Pass@5 (Pass@k‑aware) | Pass@5 (Anti‑Redundancy) |
|---|---|---|---|
| CodeGen‑6B (HumanEval) | 42.1 % | 45.3 % | 46.0 % |
| StarCoder‑15B (MBPP) | 38.7 % | 41.2 % | 42.5 % |
| GPT‑NeoX‑20B (LeetCode) | 31.4 % | 34.0 % | 34.8 % |
- Redundancy (average JPlag similarity) dropped from ~0.68 (baseline) to ~0.32 with the anti‑redundancy reward.
- The anti‑redundancy term adds < 2 % extra wall‑clock time per batch (mostly due to the lightweight JPlag call).
- Gains are more pronounced as k grows: at k = 20, the anti‑redundancy model outperforms Pass@k‑aware by up to 3 percentage points.
Interpretation: By penalizing near‑duplicate outputs, the model explores a richer set of implementations, increasing the odds that at least one of them satisfies the hidden test suite within the limited budget.
Practical Implications
- Better CI/CD assistants – Tools that generate patches or boilerplate code can now return a diverse shortlist, reducing the need for developers to manually sift through identical suggestions.
- Cost‑effective sampling – In production APIs where each generation call incurs latency or monetary cost, maximizing the utility of a small k is crucial. Anti‑redundancy rewards squeeze more correctness out of the same budget.
- Plug‑and‑play improvement – The JPlag‑based penalty is model‑agnostic and can be added to existing RL‑fine‑tuned code generators without re‑architecting the verifier.
- Enhanced security auditing – Diverse implementations make it harder for an attacker to infer the underlying model’s “canonical” solution, adding a layer of obfuscation for code‑generation services.
- Educational tools – Platforms that auto‑grade student submissions can present multiple correct solutions, helping learners see alternative idioms and coding styles.
Limitations & Future Work
- Similarity metric overhead – While JPlag is fast for small k, scaling to thousands of candidates could become a bottleneck; approximate hashing (e.g., MinHash) might be needed.
- Reward tuning – The λ hyper‑parameter requires careful calibration; too high a penalty can push the model toward overly exotic (but still incorrect) code.
- Test‑suite bias – The study assumes unit tests fully capture correctness; in domains with fuzzy specifications, redundancy reduction may not translate to functional gains.
- Generalization to other languages – Experiments were limited to Python; extending the approach to statically typed languages (Java, C++) may need language‑specific similarity tools.
- Long‑term diversity – The current reward only looks at the current batch; future work could incorporate a memory of past generations across sessions to avoid “global” duplication.
Overall, the paper demonstrates that thinking beyond raw correctness—by explicitly managing redundancy—yields tangible benefits for real‑world code‑generation pipelines. Developers building AI‑assisted coding tools should consider integrating anti‑redundancy signals into their reinforcement‑learning loops.
Authors
- Le Bronnec Florian
- Alexandre Verine
- Rio Yokota
- Benjamin Negrevergne
Paper Information
- arXiv ID: 2605.28022v1
- Categories: cs.CL, cs.SE
- Published: May 27, 2026
- PDF: Download PDF