[Paper] From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

Published: (April 15, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.14142v1

Overview

The paper introduces PreRL, a novel reinforcement‑learning (RL) framework that operates directly on a language model’s pre‑train distribution (P(y)) instead of the usual conditional distribution (P(y|x)). By shaping the marginal output space, the authors show that reasoning abilities can be amplified while preserving the model’s broad generative capacity—something that conventional RL on top of a frozen LLM cannot achieve.

Key Contributions

  • Pre‑train Space RL (PreRL): First method that applies reward‑driven updates to the marginal distribution (P(y)) of a frozen LLM.
  • Theoretical Gradient Alignment: Proof and empirical evidence that (\nabla \log P(y)) aligns closely with (\nabla \log P(y|x)), justifying PreRL as a surrogate for standard RL.
  • Negative Sample Reinforcement (NSR): A targeted “negative‑sample” signal that aggressively prunes implausible reasoning paths, boosting reflective and transition thoughts by ≈ 15× and ≈ 6.5× respectively.
  • Dual Space RL (DSRL): A two‑stage training recipe—first run NSR‑PreRL to expand the reasoning horizon, then switch to conventional RL for fine‑grained policy refinement.
  • Empirical Superiority: DSRL consistently beats strong baselines (e.g., standard RLHF, PPO‑based fine‑tuning) across multiple reasoning benchmarks (MathQA, GSM‑8K, and logical deduction tasks).

Methodology

  1. Starting Point – Frozen LLM: The base model is kept unchanged; its pre‑train distribution (P(y)) is treated as a policy over all possible token sequences.
  2. Reward Definition: A task‑specific reward function (R(y)) evaluates the quality of a generated answer (correctness, logical consistency, etc.).
  3. Online Update of (P(y)): Using a policy‑gradient style update, the model maximizes (\mathbb{E}_{y\sim P}[R(y)]). Because the gradient of (\log P(y)) can be computed via the model’s own logits, no extra forward passes are needed.
  4. Negative Sample Reinforcement (NSR):
    • Generate a batch of negative samples (high‑probability but incorrect answers).
    • Apply a strong negative reward, effectively pushing down their probability mass.
    • This “pruning” forces the model to allocate probability to more diverse, potentially correct reasoning trajectories.
  5. Dual Space RL (DSRL) Pipeline:
    • Phase 1 – NSR‑PreRL: Run several epochs of NSR‑driven updates to broaden the reasoning space and eliminate obvious dead‑ends.
    • Phase 2 – Standard RL (e.g., PPO): Fine‑tune the now‑pruned policy on the original reward, allowing precise optimization of the conditional distribution (P(y|x)).

All steps are compatible with existing transformer libraries (e.g., HuggingFace 🤗 Transformers) and require only modest additional compute compared with a typical RLHF run.

Results & Findings

BenchmarkBaseline (PPO‑RLHF)PreRL (NSR only)DSRL (NSR → PPO)
GSM‑8K (accuracy)71.2 %73.8 %77.5 %
MathQA (accuracy)68.5 %70.1 %74.3 %
Logical Deduction (exact match)62.0 %64.7 %68.9 %
  • Transition thoughts (the number of distinct reasoning steps before reaching a solution) increased by 14.89× under NSR‑PreRL.
  • Reflection thoughts (self‑correction loops) grew by 6.54×, indicating more internal “think‑aloud” behavior.
  • Ablation studies confirm that the gradient alignment between (\log P(y)) and (\log P(y|x)) remains > 0.92 cosine similarity throughout training, validating the theoretical claim.

Practical Implications

  • Faster Reasoning Fine‑Tuning: By pruning the wrong answer space early, developers can achieve higher accuracy with fewer RLHF epochs, saving GPU hours.
  • Better Generalization: Since PreRL works on the marginal distribution, the model retains its ability to generate diverse, creative text outside the fine‑tuned task—useful for chatbots that need both factual correctness and open‑ended generation.
  • Plug‑and‑Play RL Component: The NSR‑PreRL stage can be added on top of any existing LLM checkpoint without re‑training the entire model, making it attractive for SaaS providers that ship “base‑model + RL layer” packages.
  • Safety & Alignment: Negative‑sample reinforcement naturally suppresses toxic or hallucinated outputs that have high prior probability, offering a lightweight alignment tool before more expensive RLHF passes.

Limitations & Future Work

  • Reward Design Dependency: The approach still hinges on a well‑crafted reward function; poorly calibrated rewards can misguide the pruning process.
  • Scalability to Very Large Models: Experiments were run on 7‑B and 13‑B parameter models; extending to 70‑B+ scales may require gradient‑checkpointing tricks to keep memory usage tractable.
  • Static Corpus Shift: While PreRL mitigates distribution shift, the underlying pre‑training corpus remains static; future work could explore continual pre‑train‑space updates with streaming data.
  • Broader Task Spectrum: The paper focuses on reasoning‑heavy benchmarks; applying NSR‑PreRL to generation‑centric tasks (e.g., code synthesis, story generation) is an open avenue.

Authors

  • Yuqiao Tan
  • Minzheng Wang
  • Bo Liu
  • Zichen Liu
  • Tian Liang
  • Shizhu He
  • Jun Zhao
  • Kang Liu

Paper Information

  • arXiv ID: 2604.14142v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: April 15, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »