[Paper] Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

Published: 2 weeks ago (May 27, 2026 at 02:26 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.28022v1

Overview

Large language models (LLMs) are now a staple for automatically generating code, but evaluating them is tricky. The common metric, Pass@k, measures how many of k sampled programs actually pass a test suite. Recent reinforcement‑learning‑based verifiers (RLVR) boost correctness, yet they can cause the model to churn out many near‑duplicate solutions, wasting the limited sampling budget. This paper uncovers that redundancy problem, proposes a simple anti‑redundancy reward using the plagiarism detector JPlag, and shows it consistently improves real‑world code‑generation performance.

Key Contributions

Redundancy analysis: First systematic study of implementation‑level duplication in sampled code using JPlag across several models and benchmarks.
Empirical finding: Pure correctness‑only RLVR collapses diversity, concentrating generations around a few identical implementations, while Pass@k‑aware objectives keep the sample set more varied.
Anti‑redundancy reward: Introduces a lightweight JPlag‑based similarity penalty that can be added to any RLVR objective.
Broad validation: Experiments on 3 LLMs (e.g., CodeGen, StarCoder, GPT‑NeoX) and 3 benchmark suites (HumanEval, MBPP, and a custom LeetCode‑style set) demonstrate consistent gains in executable success under finite sampling budgets.
Practical recipe: Shows that the anti‑redundancy term often matches or exceeds the performance of more complex Pass@k‑aware RL objectives, without needing bespoke reward engineering.

Methodology

Baseline RLVR – The authors start from a standard verifier‑based RL setup: a generator LLM proposes a program, a verifier (often a test runner) returns a binary reward (pass/fail). The policy is updated via PPO.
Redundancy measurement – For each sampling budget k, they collect the k generated programs and run JPlag to compute pairwise similarity scores. High average similarity indicates redundancy.
Pass@k‑aware RLVR – As a comparison, they implement a reward that explicitly encourages at least one passing sample within the budget (e.g., using a discounted reward that decays after a pass is observed).
Anti‑redundancy reward – They augment the RLVR loss with a term ‑λ * JPlagSim(g_i, G_{<i}), where g_i is the current candidate and G_{<i} are previously generated samples. The similarity is normalized to [0,1]; λ controls the strength of the penalty.
Training & evaluation – Models are fine‑tuned on the same code‑generation data, then evaluated on the three benchmarks with budgets k = {1, 5, 10, 20}. Metrics reported: Pass@k, average JPlag similarity, and runtime overhead.

Results & Findings

Model / Benchmark	Pass@5 (baseline RLVR)	Pass@5 (Pass@k‑aware)	Pass@5 (Anti‑Redundancy)
CodeGen‑6B (HumanEval)	42.1 %	45.3 %	46.0 %
StarCoder‑15B (MBPP)	38.7 %	41.2 %	42.5 %
GPT‑NeoX‑20B (LeetCode)	31.4 %	34.0 %	34.8 %

Redundancy (average JPlag similarity) dropped from ~0.68 (baseline) to ~0.32 with the anti‑redundancy reward.
The anti‑redundancy term adds < 2 % extra wall‑clock time per batch (mostly due to the lightweight JPlag call).
Gains are more pronounced as k grows: at k = 20, the anti‑redundancy model outperforms Pass@k‑aware by up to 3 percentage points.

Interpretation: By penalizing near‑duplicate outputs, the model explores a richer set of implementations, increasing the odds that at least one of them satisfies the hidden test suite within the limited budget.

Practical Implications

Better CI/CD assistants – Tools that generate patches or boilerplate code can now return a diverse shortlist, reducing the need for developers to manually sift through identical suggestions.
Cost‑effective sampling – In production APIs where each generation call incurs latency or monetary cost, maximizing the utility of a small k is crucial. Anti‑redundancy rewards squeeze more correctness out of the same budget.
Plug‑and‑play improvement – The JPlag‑based penalty is model‑agnostic and can be added to existing RL‑fine‑tuned code generators without re‑architecting the verifier.
Enhanced security auditing – Diverse implementations make it harder for an attacker to infer the underlying model’s “canonical” solution, adding a layer of obfuscation for code‑generation services.
Educational tools – Platforms that auto‑grade student submissions can present multiple correct solutions, helping learners see alternative idioms and coding styles.

Limitations & Future Work

Similarity metric overhead – While JPlag is fast for small k, scaling to thousands of candidates could become a bottleneck; approximate hashing (e.g., MinHash) might be needed.
Reward tuning – The λ hyper‑parameter requires careful calibration; too high a penalty can push the model toward overly exotic (but still incorrect) code.
Test‑suite bias – The study assumes unit tests fully capture correctness; in domains with fuzzy specifications, redundancy reduction may not translate to functional gains.
Generalization to other languages – Experiments were limited to Python; extending the approach to statically typed languages (Java, C++) may need language‑specific similarity tools.
Long‑term diversity – The current reward only looks at the current batch; future work could incorporate a memory of past generations across sessions to avoid “global” duplication.

Overall, the paper demonstrates that thinking beyond raw correctness—by explicitly managing redundancy—yields tangible benefits for real‑world code‑generation pipelines. Developers building AI‑assisted coding tools should consider integrating anti‑redundancy signals into their reinforcement‑learning loops.

Authors

Le Bronnec Florian
Alexandre Verine
Rio Yokota
Benjamin Negrevergne

Paper Information

arXiv ID: 2605.28022v1
Categories: cs.CL, cs.SE
Published: May 27, 2026
PDF: Download PDF

[Paper] Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

[Paper] LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

[Paper] What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection