[Paper] On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Source: arXiv - 2512.07783v1
Overview
The paper investigates why reinforcement‑learning (RL) fine‑tuning sometimes makes language models (LMs) better at reasoning, and when it actually adds new capabilities beyond what the model learned during pre‑training. By building a fully controllable synthetic benchmark, the authors isolate the separate effects of (1) massive pre‑training, (2) a focused “mid‑training” stage, and (3) RL‑based post‑training. Their findings demystify the conditions under which RL truly improves reasoning and point to a previously under‑appreciated role for mid‑training.
Key Contributions
- Controlled experimental framework: synthetic reasoning tasks with explicit atomic operations and traceable step‑by‑step solutions, enabling causal attribution of performance gains.
- Three‑phase training analysis: systematic comparison of pre‑training, mid‑training, and RL fine‑tuning under identical compute budgets.
- Boundary‑condition insight: RL yields genuine capability gains only when the model still has “headroom” after pre‑training and the RL data sit at the edge of the model’s competence.
- Contextual transfer: Minimal but sufficient pre‑training exposure is enough for RL to generalize reasoning across different surface forms (e.g., paraphrases).
- Mid‑training advantage: Adding a targeted mid‑training phase (no RL) consistently outperforms RL‑only fine‑tuning for the same compute budget.
- Process‑level rewards: Rewarding correct intermediate reasoning steps reduces reward‑hacking and improves the fidelity of the generated reasoning traces.
Methodology
-
Synthetic Reasoning Suite – The authors construct a set of toy problems (e.g., arithmetic on lists, symbolic manipulation) that can be broken down into a sequence of atomic operations (add, multiply, lookup, etc.). Each problem comes with a ground‑truth reasoning trace, making it easy to verify whether a model’s answer follows the correct steps.
-
Training Phases
- Pre‑training: Large‑scale language modeling on a generic corpus (simulated with random text) to give the model basic linguistic knowledge.
- Mid‑training: A focused supervised phase on a subset of the synthetic tasks, designed to teach the model the structure of the reasoning operations without any RL signal.
- RL post‑training: Proximal Policy Optimization (PPO) where the reward is based on the final answer correctness and (in the process‑reward variant) on the correctness of each intermediate step.
-
Evaluation Axes
- Extrapolative generalization: Test on longer or more deeply nested compositions than seen during training.
- Contextual generalization: Test on the same logical tasks expressed with different wording or formatting.
-
Controlled Variables – Compute budget, model size, and data distribution are held constant across experiments, allowing a clean causal comparison of the three phases.
Results & Findings
| Training Regime | Extrapolation (pass@128) | Contextual Transfer | Compute Efficiency |
|---|---|---|---|
| Pre‑train only | Low (≈10 %) | Near‑random | Baseline |
| Mid‑train only (no RL) | Moderate (≈35 %) | Good (≈70 %) | 1× |
| RL only (post‑pre‑train) | High only when pre‑train headroom exists (≈55 %) | Good if pre‑train gave minimal exposure | 1× |
| Mid‑train + RL | Best overall (≈70 % extrapolation, ≈85 % contextual) | Highest transfer | Same compute as RL‑only |
- RL gains are conditional: When pre‑training already saturates the task distribution, RL adds little; when the model is still “on the edge,” RL pushes it over the line.
- Process‑level rewards cut down on “reward hacking” (e.g., models learning to output the correct answer without proper reasoning) and improve trace correctness by ~15 %.
- Mid‑training shines: With the same compute budget, a short supervised phase on the target reasoning patterns yields larger jumps than RL alone, suggesting that teaching the shape of the problem first is crucial.
Practical Implications
- Designing RL pipelines: Before launching expensive RL fine‑tuning, verify that the base model still has capacity on the target task. Use a curriculum that presents RL data right at the competence boundary rather than far beyond it.
- Mid‑training as a cheap boost: Insert a short, supervised “mid‑training” stage that focuses on the core reasoning primitives of your downstream task (e.g., code analysis, math, logical inference). This can be far cheaper than RL and yields comparable or better gains.
- Reward engineering: Incorporate intermediate step verification (e.g., unit tests, symbolic checks) into the RL reward to enforce faithful reasoning, which is especially relevant for safety‑critical applications like automated theorem proving or financial decision support.
- Transfer across contexts: Minimal exposure to diverse surface forms during pre‑training (or a quick “contextual fine‑tuning” pass) is enough for RL to generalize reasoning to new phrasings, reducing the need for exhaustive data augmentation.
- Compute budgeting: For a fixed compute budget, allocate a portion to mid‑training before RL; the paper shows this yields higher overall performance than spending it all on RL.
Limitations & Future Work
- Synthetic domain: The benchmark uses toy tasks with clean, deterministic operations; real‑world reasoning (e.g., commonsense, code synthesis) is messier and may not follow the same patterns.
- Scale: Experiments are run on modest‑size models (≈125 M parameters). It remains open how the findings translate to multi‑billion‑parameter LMs.
- Reward design complexity: Process‑level rewards require a way to automatically verify intermediate steps, which may be non‑trivial for unstructured domains.
- Future directions: Extending the framework to semi‑synthetic or natural‑language reasoning datasets, exploring automated curriculum generation for RL boundary data, and testing the interplay on larger models and multi‑modal inputs.
Authors
- Charlie Zhang
- Graham Neubig
- Xiang Yue
Paper Information
- arXiv ID: 2512.07783v1
- Categories: cs.CL
- Published: December 8, 2025
- PDF: Download PDF