[Paper] LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Published: (March 2, 2026 at 01:07 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.02146v1

Overview

The paper LongRLVR tackles a fundamental roadblock in teaching large language models (LLMs) to reason over long documents. While prior Reinforcement Learning with Verifiable Rewards (RLVR) methods improve factual reasoning, they stumble when the answer depends on locating and using information scattered across a lengthy context. The authors show that rewarding only the final answer makes the learning signal too sparse, and they propose a dense “context reward” that explicitly praises the model for picking the right evidence.

Key Contributions

  • Theoretical analysis proving that answer‑only rewards cause vanishing gradients for the grounding (evidence‑selection) step in long‑context tasks.
  • LongRLVR framework that augments the sparse answer reward with a dense, verifiable context reward, providing a clear learning signal for evidence retrieval.
  • Empirical validation on multiple long‑context benchmarks (RULER‑QA, LongBench v2) using Qwen and LLaMA families, showing consistent, large performance gains (e.g., 14B model ↑ from 73.17 → 88.90 on RULER‑QA).
  • Open‑source implementation released on GitHub, enabling reproducibility and easy integration into existing RLVR pipelines.

Methodology

  1. Problem Setup – The task is framed as a two‑stage process: (a) grounding: select relevant passages from a long context; (b) answer generation: produce the final answer.
  2. Reward Design
    • Answer Reward (R_ans): binary/verifiable reward based on whether the final answer matches the ground‑truth.
    • Context Reward (R_ctx): dense reward computed by checking the overlap between the model‑selected passages and a set of gold evidence passages (e.g., using ROUGE or exact match).
    • The total reward is a weighted sum: R_total = λ * R_ans + (1‑λ) * R_ctx.
  3. Training Loop – Standard policy‑gradient RL (e.g., PPO) is applied, but gradients now flow through both the grounding and answer generation components thanks to R_ctx.
  4. Verification – The “verifiable” part means that both rewards can be computed automatically without human annotation at training time, using existing evidence annotations in the benchmark datasets.

The approach is deliberately simple: add a second, dense signal that tells the model what it got right during the evidence‑selection phase, thereby avoiding the sparse‑reward problem.

Results & Findings

Model (size)Baseline RLVR (RULER‑QA)LongRLVR (RULER‑QA)Baseline RLVR (LongBench v2)LongRLVR (LongBench v2)
Qwen‑14B73.1788.90 (+15.73)39.846.5 (+6.7)
LLaMA‑13B68.482.1 (+13.7)35.242.0 (+6.8)
LLaMA‑7B61.574.3 (+12.8)30.137.9 (+7.8)
  • Across all model sizes and both benchmarks, LongRLVR delivers significant lifts over the vanilla RLVR baseline.
  • Ablation studies (varying λ, removing R_ctx) confirm that the context reward is the primary driver of improvement; without it, performance regresses to baseline levels.
  • Gradient analysis shows that the context reward restores non‑vanishing gradients for the grounding module, enabling stable training even with very long inputs (up to several thousand tokens).

Practical Implications

  • Better Retrieval‑Augmented Generation (RAG): Developers building QA or summarization systems over large corpora (e.g., legal docs, codebases, scientific literature) can plug the LongRLVR reward scheme into their RL fine‑tuning pipelines to get more reliable evidence selection.
  • Reduced Hallucinations: By explicitly rewarding correct grounding, models are less likely to fabricate answers, a critical safety improvement for downstream applications like customer support bots or medical assistants.
  • Scalable to Existing LLMs: The method works with off‑the‑shelf Qwen/LLaMA checkpoints; no architectural changes are required, making it a low‑friction upgrade for teams already using RLVR.
  • Potential for Tool‑Use: The dense context reward can be adapted to reward successful calls to external tools (search APIs, databases), opening a path toward more robust tool‑augmented agents.

Limitations & Future Work

  • Dependence on Gold Evidence: The context reward assumes access to annotated evidence passages; in domains lacking such labels, the reward must be approximated (e.g., via weak supervision).
  • Reward Weight Sensitivity: Choosing the λ balance between answer and context rewards requires validation; a sub‑optimal λ can diminish gains.
  • Scalability of Verification: Computing R_ctx for very large corpora may become costly; future work could explore approximate or learned verification models.
  • Extending Beyond QA: The paper focuses on QA benchmarks; applying the same principle to tasks like long‑form generation, code synthesis, or multi‑turn dialogue remains an open avenue.

LongRLVR demonstrates that rewarding the process of grounding information is as important as rewarding the outcome. For developers building LLM‑powered systems that must sift through massive context, this insight offers a practical recipe to boost accuracy and trustworthiness without overhauling existing models.

Authors

  • Guanzheng Chen
  • Michael Qizhe Shieh
  • Lidong Bing

Paper Information

  • arXiv ID: 2603.02146v1
  • Categories: cs.CL
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »