[Paper] Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Published: 3 days ago (May 7, 2026 at 01:48 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06638v1

Overview

The paper investigates how reinforcement learning (RL) can teach large language models (LLMs) to perform long‑horizon logical reasoning. By introducing a controllable synthetic environment called ScaleLogic, the authors systematically explore how training effort scales with the depth of reasoning required and with the expressive power of the underlying logic. Their findings reveal a clear power‑law relationship and show that training on more expressive logics yields stronger, more compute‑efficient transfer to real‑world reasoning tasks.

Key Contributions

ScaleLogic framework: a synthetic benchmark that lets researchers vary (1) proof depth (how many reasoning steps are needed) and (2) logical expressiveness (from simple “if‑then” to full first‑order logic with ∧, ∨, ¬, ∀).
Empirical scaling law: RL compute (T) grows as a power of reasoning depth (D) ((T \propto D^{\gamma})) with an (R^{2}>0.99). The exponent (\gamma) rises from ~1.0 for trivial logics to ~2.6 for richer logics.
Transfer benefits: Models trained on more expressive settings achieve up to +10.66 points on downstream math and reasoning benchmarks and require less compute to reach the same performance compared to models trained on simpler logics.
Method‑agnostic scaling: The power‑law holds across several RL algorithms (e.g., PPO, A2C), indicating the phenomenon is not tied to a specific optimizer.
Curriculum learning boost: Introducing a curriculum that gradually increases depth dramatically improves scaling efficiency, reducing the required compute for a given performance level.

Methodology

Synthetic environment design – ScaleLogic generates random logical statements and corresponding proofs. The user selects a logic family (implication‑only, conjunction‑enabled, full first‑order) and a depth (D) that dictates how many inference steps a correct proof must contain.
LLM + RL loop – An LLM (e.g., GPT‑2/3‑size) proposes a proof step; an RL reward signal evaluates correctness (0/1) and optionally gives shaping rewards for partial progress. The policy is updated with standard policy‑gradient methods.
Scaling experiments – For each logic family, the authors train models across a range of depths (e.g., (D=2) to (D=20)) and record total compute (GPU‑hours). They fit a power‑law curve (T = a D^{\gamma}).
Curriculum schedule – A separate set of runs starts with shallow proofs and progressively increases (D) once performance plateaus, mimicking “easy‑to‑hard” learning.
Downstream evaluation – After RL fine‑tuning, the same LLM is tested on public reasoning datasets (MATH, GSM‑8K, LogicalDeduction) without further task‑specific training to measure transfer.

Results & Findings

Logic family	Scaling exponent (\gamma)	Compute to reach 70% depth‑accuracy	Transfer gain (Δ points)
Implication‑only	1.04	12 GPU‑hrs	+2.3
Conjunction‑enabled	1.68	38 GPU‑hrs	+5.7
Full first‑order (∧,∨,¬,∀)	2.60	112 GPU‑hrs	+10.66

Power‑law fit: (R^{2}>0.99) across all families, confirming a predictable scaling pattern.
Expressiveness matters: Higher‑expressiveness training not only yields larger absolute gains on downstream tasks but also improves compute efficiency—the same performance is achieved with ~30 % less compute when using a curriculum.
Algorithm robustness: PPO, A2C, and REINFORCE all exhibit the same exponent trends, suggesting the scaling law is intrinsic to the reasoning problem rather than the optimizer.
Curriculum effect: Curriculum‑trained models achieve the same final accuracy with roughly half the compute compared to a naïve “train at max depth from day one” baseline.

Practical Implications

LLM fine‑tuning pipelines: Teams can adopt a curriculum‑based RL fine‑tuning stage that first teaches shallow logical steps before moving to deeper proofs, dramatically cutting training costs.
Benchmark design: The ScaleLogic methodology offers a reproducible way to stress‑test reasoning capabilities of new LLMs before deploying them on costly real‑world datasets.
Productivity tools: Applications that rely on multi‑step reasoning (e.g., code synthesis assistants, automated theorem provers, data‑pipeline planners) can benefit from RL‑enhanced LLMs trained on richer logical forms, leading to more reliable step‑by‑step suggestions.
Compute budgeting: Knowing that compute scales as (D^{\gamma}) lets engineers estimate resources needed for a target reasoning horizon, making project planning more transparent.
Cross‑domain transfer: Since expressive training improves performance on unrelated math and logic tasks, organizations can invest in a single, well‑designed RL curriculum rather than task‑specific fine‑tuning for each downstream problem.

Limitations & Future Work

Synthetic vs. real data: ScaleLogic, while controllable, may not capture the full messiness of natural language reasoning (ambiguities, implicit premises).
Model size scope: Experiments focus on mid‑size LLMs; scaling behavior for billion‑parameter models remains an open question.
Reward sparsity: The binary correctness reward can be noisy for very deep proofs; exploring denser shaping rewards or hierarchical RL could further improve efficiency.
Generalization to non‑logical tasks: Extending the curriculum approach to domains like planning, debugging, or multi‑modal reasoning is a promising direction.

By exposing a clear scaling law and demonstrating the outsized benefit of expressive logical training, this work provides a practical roadmap for developers looking to endow LLMs with robust, long‑horizon reasoning abilities.

Authors

Tianle Wang
Zhaoyang Wang
Guangchen Lan
Xinpeng Wei
Sipeng Zhang
Guanwen Qiu
Abulhair Saparov

Paper Information

arXiv ID: 2605.06638v1
Categories: cs.AI, cs.CL
Published: May 7, 2026
PDF: Download PDF

[Paper] Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims