[Paper] On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Published: 1 week ago (December 8, 2025 at 01:12 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07783v1

Overview

The paper investigates why reinforcement‑learning (RL) fine‑tuning sometimes makes language models (LMs) better at reasoning, and when it actually adds new capabilities beyond what the model learned during pre‑training. By building a fully controllable synthetic benchmark, the authors isolate the separate effects of (1) massive pre‑training, (2) a focused “mid‑training” stage, and (3) RL‑based post‑training. Their findings demystify the conditions under which RL truly improves reasoning and point to a previously under‑appreciated role for mid‑training.

Key Contributions

Controlled experimental framework: synthetic reasoning tasks with explicit atomic operations and traceable step‑by‑step solutions, enabling causal attribution of performance gains.
Three‑phase training analysis: systematic comparison of pre‑training, mid‑training, and RL fine‑tuning under identical compute budgets.
Boundary‑condition insight: RL yields genuine capability gains only when the model still has “headroom” after pre‑training and the RL data sit at the edge of the model’s competence.
Contextual transfer: Minimal but sufficient pre‑training exposure is enough for RL to generalize reasoning across different surface forms (e.g., paraphrases).
Mid‑training advantage: Adding a targeted mid‑training phase (no RL) consistently outperforms RL‑only fine‑tuning for the same compute budget.
Process‑level rewards: Rewarding correct intermediate reasoning steps reduces reward‑hacking and improves the fidelity of the generated reasoning traces.

Methodology

Synthetic Reasoning Suite – The authors construct a set of toy problems (e.g., arithmetic on lists, symbolic manipulation) that can be broken down into a sequence of atomic operations (add, multiply, lookup, etc.). Each problem comes with a ground‑truth reasoning trace, making it easy to verify whether a model’s answer follows the correct steps.
Training Phases
- Pre‑training: Large‑scale language modeling on a generic corpus (simulated with random text) to give the model basic linguistic knowledge.
- Mid‑training: A focused supervised phase on a subset of the synthetic tasks, designed to teach the model the structure of the reasoning operations without any RL signal.
- RL post‑training: Proximal Policy Optimization (PPO) where the reward is based on the final answer correctness and (in the process‑reward variant) on the correctness of each intermediate step.
Evaluation Axes
- Extrapolative generalization: Test on longer or more deeply nested compositions than seen during training.
- Contextual generalization: Test on the same logical tasks expressed with different wording or formatting.
Controlled Variables – Compute budget, model size, and data distribution are held constant across experiments, allowing a clean causal comparison of the three phases.

Results & Findings

Training Regime	Extrapolation (pass@128)	Contextual Transfer	Compute Efficiency
Pre‑train only	Low (≈10 %)	Near‑random	Baseline
Mid‑train only (no RL)	Moderate (≈35 %)	Good (≈70 %)	1×
RL only (post‑pre‑train)	High only when pre‑train headroom exists (≈55 %)	Good if pre‑train gave minimal exposure	1×
Mid‑train + RL	Best overall (≈70 % extrapolation, ≈85 % contextual)	Highest transfer	Same compute as RL‑only

RL gains are conditional: When pre‑training already saturates the task distribution, RL adds little; when the model is still “on the edge,” RL pushes it over the line.
Process‑level rewards cut down on “reward hacking” (e.g., models learning to output the correct answer without proper reasoning) and improve trace correctness by ~15 %.
Mid‑training shines: With the same compute budget, a short supervised phase on the target reasoning patterns yields larger jumps than RL alone, suggesting that teaching the shape of the problem first is crucial.

Practical Implications

Designing RL pipelines: Before launching expensive RL fine‑tuning, verify that the base model still has capacity on the target task. Use a curriculum that presents RL data right at the competence boundary rather than far beyond it.
Mid‑training as a cheap boost: Insert a short, supervised “mid‑training” stage that focuses on the core reasoning primitives of your downstream task (e.g., code analysis, math, logical inference). This can be far cheaper than RL and yields comparable or better gains.
Reward engineering: Incorporate intermediate step verification (e.g., unit tests, symbolic checks) into the RL reward to enforce faithful reasoning, which is especially relevant for safety‑critical applications like automated theorem proving or financial decision support.
Transfer across contexts: Minimal exposure to diverse surface forms during pre‑training (or a quick “contextual fine‑tuning” pass) is enough for RL to generalize reasoning to new phrasings, reducing the need for exhaustive data augmentation.
Compute budgeting: For a fixed compute budget, allocate a portion to mid‑training before RL; the paper shows this yields higher overall performance than spending it all on RL.

Limitations & Future Work

Synthetic domain: The benchmark uses toy tasks with clean, deterministic operations; real‑world reasoning (e.g., commonsense, code synthesis) is messier and may not follow the same patterns.
Scale: Experiments are run on modest‑size models (≈125 M parameters). It remains open how the findings translate to multi‑billion‑parameter LMs.
Reward design complexity: Process‑level rewards require a way to automatically verify intermediate steps, which may be non‑trivial for unstructured domains.
Future directions: Extending the framework to semi‑synthetic or natural‑language reasoning datasets, exploring automated curriculum generation for RL boundary data, and testing the interplay on larger models and multi‑modal inputs.

Authors

Charlie Zhang
Graham Neubig
Xiang Yue

Paper Information

arXiv ID: 2512.07783v1
Categories: cs.CL
Published: December 8, 2025
PDF: Download PDF

[Paper] On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

[Paper] Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

[Paper] Explaining the Reasoning of Large Language Models Using Attribution Graphs

[Paper] PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning