[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning
Source: arXiv - 2602.06960v1
Overview
The paper InftyThink+ tackles a core bottleneck of large language models (LLMs) that reason through long, multi‑step problems: the cost of keeping every intermediate “thought” in the prompt grows quadratically, hits context‑length limits, and often leads to the model “forgetting” earlier steps. By framing the reasoning process as a controllable, iterative loop—where the model decides when to compress its current chain of thought into a summary and then continue—the authors show how reinforcement learning (RL) can teach LLMs to reason effectively and efficiently over an infinite horizon.
Key Contributions
- RL‑driven iterative reasoning: Introduces a reinforcement‑learning framework that jointly learns when to summarize, what to keep, and how to resume reasoning, rather than relying on fixed heuristics or supervised checkpoints.
- Two‑stage training pipeline: Starts with a supervised “cold‑start” to give the model a basic reasoning ability, then fine‑tunes the entire reasoning trajectory with trajectory‑level RL for strategic summarization.
- Model‑controlled iteration boundaries: The policy learns to place iteration boundaries dynamically, enabling flexible chain‑of‑thought lengths tailored to each problem.
- Empirical gains on challenging math benchmarks: Using the DeepSeek‑R1‑Distill‑Qwen‑1.5B backbone, InftyThink+ lifts accuracy by 21 % on AIME‑24 and consistently beats standard long chain‑of‑thought RL baselines.
- Efficiency improvements: Demonstrates up to ~30 % reduction in inference latency and faster RL convergence, proving that smarter summarization also speeds up training.
- Better out‑of‑distribution robustness: The learned summarization policy generalizes to unseen reasoning tasks better than static heuristics.
Methodology
-
Iterative Reasoning Loop
- The LLM generates a segment of reasoning steps (a “thought chunk”).
- A summarizer compresses this chunk into a concise representation (a short text summary).
- The compressed summary is appended to the prompt, and the model continues generating the next chunk.
-
Two‑Stage Training
- Stage 1 – Supervised Warm‑up: The model is trained on human‑written chain‑of‑thought data, learning to produce correct intermediate steps and reasonable summaries.
- Stage 2 – Trajectory‑Level RL: The entire loop (generation → summarization → continuation) is treated as a single RL episode.
- State: Current prompt (including accumulated summaries).
- Action: Decide how many steps to generate before summarizing and what summarization strategy to use.
- Reward: Composite signal combining final answer correctness, inference latency, and a penalty for excessive prompt length.
-
Policy Architecture
- A lightweight controller (e.g., a small transformer) sits on top of the backbone LLM and outputs a distribution over possible iteration lengths and summarization modes.
- The controller is updated with Proximal Policy Optimization (PPO), while the backbone LLM’s parameters are fine‑tuned jointly to align generation with the policy’s decisions.
-
Implementation Details
- Backbone: DeepSeek‑R1‑Distill‑Qwen‑1.5B (≈1.5 B parameters).
- Summarizer: Same backbone, fine‑tuned to produce ≤ 30‑token abstracts of the preceding chunk.
- Training budget: ~48 GPU‑hours for supervised warm‑up + ~72 GPU‑hours for RL fine‑tuning.
Results & Findings
| Benchmark | Baseline (long CoT) | InftyThink+ | Δ Accuracy | Inference Latency ↓ |
|---|---|---|---|---|
| AIME‑24 | 38 % | 59 % | +21 % | ~30 % |
| MATH (OOD) | 45 % | 52 % | +7 % | ~25 % |
| GSM‑8K (OOD) | 71 % | 75 % | +4 % | ~20 % |
- Strategic summarization reduces prompt length without sacrificing the logical flow, leading to faster inference.
- RL fine‑tuning converges in ≈½ the wall‑clock time compared to a vanilla long‑CoT RL baseline, thanks to shorter trajectories and clearer reward signals.
- Ablation studies show that learning when to summarize contributes the most to accuracy gains, while learning what to preserve mainly drives latency improvements.
Practical Implications
- Scalable Reasoning Services: Cloud APIs that expose LLM reasoning (e.g., code‑generation assistants, math tutoring bots) can adopt InftyThink+ to cut latency and cost, especially for queries that would otherwise require thousands of tokens.
- Memory‑Constrained Deployments: Edge devices or on‑premise inference servers with limited context windows can now handle deeper reasoning by summarizing on the fly.
- Improved RL‑based Alignment: The trajectory‑level RL formulation offers a template for other alignment tasks where the process (not just the final answer) matters—e.g., multi‑turn dialogue planning or step‑wise debugging.
- Tooling Integration: Existing chain‑of‑thought pipelines can be retrofitted with a lightweight summarizer and a policy head, reusing the same backbone model, making adoption relatively low‑cost.
Limitations & Future Work
- Summarizer Quality Dependency: If the summarizer drops crucial logical details, downstream steps can go off‑track. The current approach relies on the same LLM for summarization, which may inherit its own biases.
- Reward Engineering: The composite reward balances accuracy vs. latency; tuning these weights for different domains (e.g., legal reasoning vs. math) may require manual effort.
- Scalability to Larger Models: Experiments were limited to a 1.5 B‑parameter model; it remains to be seen how the approach scales to 30 B‑plus models where policy learning could become more unstable.
- Generalization Beyond Math: While OOD math benchmarks showed gains, broader reasoning domains (e.g., scientific literature synthesis) need dedicated evaluation.
Future directions include hierarchical summarization (multi‑level abstracts), meta‑learning of reward weights for domain adaptation, and extending the RL loop to incorporate external tools (e.g., calculators or code interpreters) for truly open‑ended problem solving.
Authors
- Yuchen Yan
- Liang Jiang
- Jin Jiang
- Shuaicheng Li
- Zujie Wen
- Zhiqiang Zhang
- Jun Zhou
- Jian Shao
- Yueting Zhuang
- Yongliang Shen
Paper Information
- arXiv ID: 2602.06960v1
- Categories: cs.CL, cs.AI
- Published: February 6, 2026
- PDF: Download PDF