[Paper] Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning
Source: arXiv - 2511.21581v1
Overview
The paper “Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning” proposes a way to make large language models (LLMs) reason more efficiently. By letting the model decide on‑the‑fly how many latent reasoning steps to take, it can cut the amount of computation needed without sacrificing answer quality—an attractive prospect for anyone deploying LLMs at scale.
Key Contributions
- Adaptive‑length latent reasoning: Introduces a reinforcement‑learning (RL) controller that learns to stop the reasoning chain once sufficient information has been gathered.
- Post‑SFT RL fine‑tuning: Applies RL after standard supervised‑fine‑tuning (SFT) to directly optimize the trade‑off between reasoning length and task accuracy.
- Empirical gains on a 1B Llama 3.2 model: Demonstrates a 52 % reduction in total reasoning tokens on the GSM8K‑Aug benchmark with no drop in correctness.
- Open‑source release: Provides code, training scripts, and pretrained weights, enabling reproducibility and rapid adoption.
Methodology
Latent Reasoning Backbone
The authors start from a standard Transformer that, instead of emitting human‑readable “chain‑of‑thought” tokens, passes an internal latent state from one reasoning step to the next. This removes the bottleneck of language‑level tokenization.
RL Controller
A lightweight policy network observes the current latent state and decides whether to:
- Continue: Run another latent reasoning iteration, or
- Stop: Emit the final answer.
The policy is trained with a reward that balances two objectives:
- Accuracy reward (positive if the final answer matches the ground truth).
- Efficiency penalty proportional to the number of latent steps taken.
Training Pipeline
- Stage 1: Supervised fine‑tuning (SFT) on the GSM8K‑Aug dataset to teach the model basic problem‑solving.
- Stage 2: Post‑SFT RL fine‑tuning where the controller learns to truncate reasoning when possible.
Evaluation
The authors measure both reasoning length (total latent tokens generated) and task accuracy on the benchmark, comparing the adaptive model against a fixed‑length baseline.
Results & Findings
| Metric | Fixed‑length baseline | Adaptive latent reasoning |
|---|---|---|
| Average reasoning length (tokens) | 1.84 × baseline | 0.88 × baseline (≈ 52 % reduction) |
| Accuracy (GSM8K‑Aug) | 78.3 % | 78.4 % (no statistically significant drop) |
| Inference compute (FLOPs) | 1.0 × baseline | ≈ 0.55 × baseline |
What this means: The RL controller learns to stop early on “easy” problems while still taking enough steps on harder ones, achieving near‑identical accuracy with roughly half the compute budget.
Practical Implications
- Cost savings for production LLM services – Halving the number of latent steps translates directly into lower GPU usage and faster response times, which is critical for SaaS APIs and on‑device inference.
- Dynamic inference budgets – Developers can set a maximum allowable latency or energy budget, and the adaptive controller will automatically respect it by cutting reasoning short when possible.
- Scalable reasoning for edge devices – The approach is model‑agnostic; applying it to smaller, quantized models could make sophisticated reasoning feasible on phones or IoT hardware.
- Simplified pipeline – Because the RL fine‑tuning occurs after standard SFT, existing fine‑tuned models can be upgraded without retraining from scratch.
Limitations & Future Work
- Model size & dataset scope – Experiments are limited to a 1 B‑parameter Llama 3.2 model and a single math‑oriented dataset (GSM8K‑Aug). Results may differ on larger models or more diverse tasks.
- Reward design sensitivity – The balance between accuracy and efficiency hinges on hyper‑parameters; sub‑optimal settings could either over‑trim reasoning or waste compute.
- Interpretability – Latent reasoning steps are not human‑readable, making debugging or auditing more challenging.
Future directions outlined by the authors include extending the method to other LLM families, exploring different RL reward formulations, testing architectural variants (e.g., deeper latent modules), and integrating knowledge‑distillation pipelines to further compress reasoning capabilities.
Authors
- Alex Ning
- Yen-Ling Kuo
- Gabe Gomes
Paper Information
- arXiv ID: 2511.21581v1
- Categories: cs.LG
- Published: November 26, 2025
- PDF: Download PDF