[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Published: 2 months ago (February 6, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.06960v1

Overview

The paper InftyThink+ tackles a core bottleneck of large language models (LLMs) that reason through long, multi‑step problems: the cost of keeping every intermediate “thought” in the prompt grows quadratically, hits context‑length limits, and often leads to the model “forgetting” earlier steps. By framing the reasoning process as a controllable, iterative loop—where the model decides when to compress its current chain of thought into a summary and then continue—the authors show how reinforcement learning (RL) can teach LLMs to reason effectively and efficiently over an infinite horizon.

Key Contributions

RL‑driven iterative reasoning: Introduces a reinforcement‑learning framework that jointly learns when to summarize, what to keep, and how to resume reasoning, rather than relying on fixed heuristics or supervised checkpoints.
Two‑stage training pipeline: Starts with a supervised “cold‑start” to give the model a basic reasoning ability, then fine‑tunes the entire reasoning trajectory with trajectory‑level RL for strategic summarization.
Model‑controlled iteration boundaries: The policy learns to place iteration boundaries dynamically, enabling flexible chain‑of‑thought lengths tailored to each problem.
Empirical gains on challenging math benchmarks: Using the DeepSeek‑R1‑Distill‑Qwen‑1.5B backbone, InftyThink+ lifts accuracy by 21 % on AIME‑24 and consistently beats standard long chain‑of‑thought RL baselines.
Efficiency improvements: Demonstrates up to ~30 % reduction in inference latency and faster RL convergence, proving that smarter summarization also speeds up training.
Better out‑of‑distribution robustness: The learned summarization policy generalizes to unseen reasoning tasks better than static heuristics.

Methodology

Iterative Reasoning Loop
- The LLM generates a segment of reasoning steps (a “thought chunk”).
- A summarizer compresses this chunk into a concise representation (a short text summary).
- The compressed summary is appended to the prompt, and the model continues generating the next chunk.
Two‑Stage Training
- Stage 1 – Supervised Warm‑up: The model is trained on human‑written chain‑of‑thought data, learning to produce correct intermediate steps and reasonable summaries.
- Stage 2 – Trajectory‑Level RL: The entire loop (generation → summarization → continuation) is treated as a single RL episode.
  - State: Current prompt (including accumulated summaries).
  - Action: Decide how many steps to generate before summarizing and what summarization strategy to use.
  - Reward: Composite signal combining final answer correctness, inference latency, and a penalty for excessive prompt length.
Policy Architecture
- A lightweight controller (e.g., a small transformer) sits on top of the backbone LLM and outputs a distribution over possible iteration lengths and summarization modes.
- The controller is updated with Proximal Policy Optimization (PPO), while the backbone LLM’s parameters are fine‑tuned jointly to align generation with the policy’s decisions.
Implementation Details
- Backbone: DeepSeek‑R1‑Distill‑Qwen‑1.5B (≈1.5 B parameters).
- Summarizer: Same backbone, fine‑tuned to produce ≤ 30‑token abstracts of the preceding chunk.
- Training budget: ~48 GPU‑hours for supervised warm‑up + ~72 GPU‑hours for RL fine‑tuning.

Results & Findings

Benchmark	Baseline (long CoT)	InftyThink+	Δ Accuracy	Inference Latency ↓
AIME‑24	38 %	59 %	+21 %	~30 %
MATH (OOD)	45 %	52 %	+7 %	~25 %
GSM‑8K (OOD)	71 %	75 %	+4 %	~20 %

Strategic summarization reduces prompt length without sacrificing the logical flow, leading to faster inference.
RL fine‑tuning converges in ≈½ the wall‑clock time compared to a vanilla long‑CoT RL baseline, thanks to shorter trajectories and clearer reward signals.
Ablation studies show that learning when to summarize contributes the most to accuracy gains, while learning what to preserve mainly drives latency improvements.

Practical Implications

Scalable Reasoning Services: Cloud APIs that expose LLM reasoning (e.g., code‑generation assistants, math tutoring bots) can adopt InftyThink+ to cut latency and cost, especially for queries that would otherwise require thousands of tokens.
Memory‑Constrained Deployments: Edge devices or on‑premise inference servers with limited context windows can now handle deeper reasoning by summarizing on the fly.
Improved RL‑based Alignment: The trajectory‑level RL formulation offers a template for other alignment tasks where the process (not just the final answer) matters—e.g., multi‑turn dialogue planning or step‑wise debugging.
Tooling Integration: Existing chain‑of‑thought pipelines can be retrofitted with a lightweight summarizer and a policy head, reusing the same backbone model, making adoption relatively low‑cost.

Limitations & Future Work

Summarizer Quality Dependency: If the summarizer drops crucial logical details, downstream steps can go off‑track. The current approach relies on the same LLM for summarization, which may inherit its own biases.
Reward Engineering: The composite reward balances accuracy vs. latency; tuning these weights for different domains (e.g., legal reasoning vs. math) may require manual effort.
Scalability to Larger Models: Experiments were limited to a 1.5 B‑parameter model; it remains to be seen how the approach scales to 30 B‑plus models where policy learning could become more unstable.
Generalization Beyond Math: While OOD math benchmarks showed gains, broader reasoning domains (e.g., scientific literature synthesis) need dedicated evaluation.

Future directions include hierarchical summarization (multi‑level abstracts), meta‑learning of reward weights for domain adaptation, and extending the RL loop to incorporate external tools (e.g., calculators or code interpreters) for truly open‑ended problem solving.

Authors

Yuchen Yan
Liang Jiang
Jin Jiang
Shuaicheng Li
Zujie Wen
Zhiqiang Zhang
Jun Zhou
Jian Shao
Yueting Zhuang
Yongliang Shen

Paper Information

arXiv ID: 2602.06960v1
Categories: cs.CL, cs.AI
Published: February 6, 2026
PDF: Download PDF

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] The Representational Geometry of Number

[Paper] Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory