[Paper] Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Published: (May 7, 2026 at 01:48 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.06638v1

Overview

The paper investigates how reinforcement learning (RL) can teach large language models (LLMs) to perform long‑horizon logical reasoning. By introducing a controllable synthetic environment called ScaleLogic, the authors systematically explore how training effort scales with the depth of reasoning required and with the expressive power of the underlying logic. Their findings reveal a clear power‑law relationship and show that training on more expressive logics yields stronger, more compute‑efficient transfer to real‑world reasoning tasks.

Key Contributions

  • ScaleLogic framework: a synthetic benchmark that lets researchers vary (1) proof depth (how many reasoning steps are needed) and (2) logical expressiveness (from simple “if‑then” to full first‑order logic with ∧, ∨, ¬, ∀).
  • Empirical scaling law: RL compute (T) grows as a power of reasoning depth (D) ((T \propto D^{\gamma})) with an (R^{2}>0.99). The exponent (\gamma) rises from ~1.0 for trivial logics to ~2.6 for richer logics.
  • Transfer benefits: Models trained on more expressive settings achieve up to +10.66 points on downstream math and reasoning benchmarks and require less compute to reach the same performance compared to models trained on simpler logics.
  • Method‑agnostic scaling: The power‑law holds across several RL algorithms (e.g., PPO, A2C), indicating the phenomenon is not tied to a specific optimizer.
  • Curriculum learning boost: Introducing a curriculum that gradually increases depth dramatically improves scaling efficiency, reducing the required compute for a given performance level.

Methodology

  1. Synthetic environment design – ScaleLogic generates random logical statements and corresponding proofs. The user selects a logic family (implication‑only, conjunction‑enabled, full first‑order) and a depth (D) that dictates how many inference steps a correct proof must contain.
  2. LLM + RL loop – An LLM (e.g., GPT‑2/3‑size) proposes a proof step; an RL reward signal evaluates correctness (0/1) and optionally gives shaping rewards for partial progress. The policy is updated with standard policy‑gradient methods.
  3. Scaling experiments – For each logic family, the authors train models across a range of depths (e.g., (D=2) to (D=20)) and record total compute (GPU‑hours). They fit a power‑law curve (T = a D^{\gamma}).
  4. Curriculum schedule – A separate set of runs starts with shallow proofs and progressively increases (D) once performance plateaus, mimicking “easy‑to‑hard” learning.
  5. Downstream evaluation – After RL fine‑tuning, the same LLM is tested on public reasoning datasets (MATH, GSM‑8K, LogicalDeduction) without further task‑specific training to measure transfer.

Results & Findings

Logic familyScaling exponent (\gamma)Compute to reach 70% depth‑accuracyTransfer gain (Δ points)
Implication‑only1.0412 GPU‑hrs+2.3
Conjunction‑enabled1.6838 GPU‑hrs+5.7
Full first‑order (∧,∨,¬,∀)2.60112 GPU‑hrs+10.66
  • Power‑law fit: (R^{2}>0.99) across all families, confirming a predictable scaling pattern.
  • Expressiveness matters: Higher‑expressiveness training not only yields larger absolute gains on downstream tasks but also improves compute efficiency—the same performance is achieved with ~30 % less compute when using a curriculum.
  • Algorithm robustness: PPO, A2C, and REINFORCE all exhibit the same exponent trends, suggesting the scaling law is intrinsic to the reasoning problem rather than the optimizer.
  • Curriculum effect: Curriculum‑trained models achieve the same final accuracy with roughly half the compute compared to a naïve “train at max depth from day one” baseline.

Practical Implications

  • LLM fine‑tuning pipelines: Teams can adopt a curriculum‑based RL fine‑tuning stage that first teaches shallow logical steps before moving to deeper proofs, dramatically cutting training costs.
  • Benchmark design: The ScaleLogic methodology offers a reproducible way to stress‑test reasoning capabilities of new LLMs before deploying them on costly real‑world datasets.
  • Productivity tools: Applications that rely on multi‑step reasoning (e.g., code synthesis assistants, automated theorem provers, data‑pipeline planners) can benefit from RL‑enhanced LLMs trained on richer logical forms, leading to more reliable step‑by‑step suggestions.
  • Compute budgeting: Knowing that compute scales as (D^{\gamma}) lets engineers estimate resources needed for a target reasoning horizon, making project planning more transparent.
  • Cross‑domain transfer: Since expressive training improves performance on unrelated math and logic tasks, organizations can invest in a single, well‑designed RL curriculum rather than task‑specific fine‑tuning for each downstream problem.

Limitations & Future Work

  • Synthetic vs. real data: ScaleLogic, while controllable, may not capture the full messiness of natural language reasoning (ambiguities, implicit premises).
  • Model size scope: Experiments focus on mid‑size LLMs; scaling behavior for billion‑parameter models remains an open question.
  • Reward sparsity: The binary correctness reward can be noisy for very deep proofs; exploring denser shaping rewards or hierarchical RL could further improve efficiency.
  • Generalization to non‑logical tasks: Extending the curriculum approach to domains like planning, debugging, or multi‑modal reasoning is a promising direction.

By exposing a clear scaling law and demonstrating the outsized benefit of expressive logical training, this work provides a practical roadmap for developers looking to endow LLMs with robust, long‑horizon reasoning abilities.

Authors

  • Tianle Wang
  • Zhaoyang Wang
  • Guangchen Lan
  • Xinpeng Wei
  • Sipeng Zhang
  • Guanwen Qiu
  • Abulhair Saparov

Paper Information

  • arXiv ID: 2605.06638v1
  • Categories: cs.AI, cs.CL
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slo...