[Paper] Recursive Models for Long-Horizon Reasoning

Published: 1 day ago (March 2, 2026 at 12:37 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.02112v1

Overview

The paper “Recursive Models for Long‑Horizon Reasoning” tackles a fundamental bottleneck of today’s large language models (LLMs): they can only reason within a fixed‑size context window. By letting a model call itself recursively to solve smaller sub‑problems in isolated contexts, the authors show how to break the “one‑shot” limitation and enable truly long‑horizon reasoning.

Key Contributions

Recursive model architecture: Introduces a minimal yet powerful framework where an LLM can invoke itself as a sub‑routine, passing a reduced context for each subtask.
Theoretical guarantees: Proves that any computable problem can be decomposed recursively so that each subtask needs only an exponentially smaller active context than a standard autoregressive pass.
Optimality within agentic systems: Extends the theory to more general “agentic” architectures (arbitrary context processing and control flow) and shows recursive models achieve the maximal possible reasoning power in that class.
Empirical validation: Trains a 3‑billion‑parameter model to operate recursively and demonstrates a large performance jump on Boolean satisfiability (SAT)—a classic long‑horizon combinatorial search problem—over state‑of‑the‑art LLMs.
Comparison to summarization‑based context management: Shows that recursion strictly outperforms any single‑sequence approach (e.g., summarizing earlier text) because it can keep the active window tiny while still solving the whole problem.

Methodology

Recursive Decomposition
- The original task is split into a tree of subtasks.
- Each node in the tree is solved by the same base language model, but only with the local context needed for that sub‑problem (e.g., a clause of a SAT formula).
Model Invocation Protocol
- The parent call packages a prompt describing the subtask and a short “scratchpad” of relevant variables.
- The child model returns a concise answer (e.g., “satisfiable” or a partial assignment).
- The parent aggregates child answers, possibly spawning further recursive calls.
Training Regime
- A 3B‑parameter transformer is fine‑tuned on synthetic recursive tasks (nested arithmetic, tree‑structured reasoning) to learn the “call‑itself” pattern.
- Curriculum learning gradually increases recursion depth, encouraging the model to keep each call’s context minimal.
Evaluation on SAT
- Instances of 3‑SAT with up to 100 variables are generated.
- The recursive model receives the full CNF formula, recursively solves sub‑clauses, and assembles a global assignment.
- Baselines include GPT‑4, Claude, and a strong fine‑tuned 7B model that uses a single large context (no recursion).

Results & Findings

System	Avg. SAT Success Rate (100‑var)	Avg. Tokens per Call
Recursive 3B (this work)	78 %	~30 (per subtask)
GPT‑4 (single‑pass)	45 %	8 k (full context)
Claude 2 (single‑pass)	42 %	—
Fine‑tuned 7B (no recursion)	48 %	—

Context reduction: The deepest recursive call needed only ~30 tokens, an exponential reduction compared to the ~8 k tokens required to hold the whole formula.
Scalability: As problem size grows, the gap widens—recursive 3B maintains >70 % success on 200‑var SAT, while single‑pass models drop below 30 %.
Generalization: The same recursive policy transfers to other combinatorial tasks (graph coloring, subset‑sum) with modest fine‑tuning, hinting at a reusable reasoning primitive.

Practical Implications

Agentic AI pipelines: Developers building autonomous agents (e.g., code‑assistants, planning bots) can embed a recursive call interface to keep each reasoning step lightweight, avoiding costly context windows.
Edge deployment: Small models (3‑B range) can now tackle problems that previously required massive models simply because they could “break the problem down” recursively. This opens doors for on‑device reasoning in constrained environments.
Tool‑use integration: Recursive calls map naturally onto existing tool‑use APIs (e.g., calling a SAT solver as a sub‑task). The paper’s framework provides a principled way to decide when to offload to a tool versus solving internally.
Debuggability: Because each sub‑task is isolated, developers can inspect intermediate prompts and outputs, making it easier to trace failures in long‑chain reasoning.

Limitations & Future Work

Recursion overhead: The current implementation incurs latency due to many sequential model invocations; batching strategies or parallel tree traversal are needed for production‑grade speed.
Learning the decomposition: The paper assumes a hand‑crafted or externally provided task split. Automating the discovery of optimal recursive decompositions remains an open challenge.
Memory of global state: While context per call is tiny, maintaining a coherent global state across many calls can be brittle; richer state‑passing mechanisms are a promising direction.
Scaling to richer domains: Experiments focus on Boolean SAT and synthetic combinatorial tasks. Extending recursion to natural‑language planning, code synthesis, or multi‑modal reasoning will test the framework’s generality.

Bottom line: By letting language models call themselves on smaller pieces of a problem, the authors demonstrate a clean, theoretically backed path to break the context‑size ceiling that has long limited LLM reasoning. For developers building next‑generation AI agents, recursion offers a practical recipe to achieve deep, multi‑step reasoning without inflating model size or context windows.

Authors

Chenxiao Yang
Nathan Srebro
Zhiyuan Li

Paper Information

arXiv ID: 2603.02112v1
Categories: cs.LG, cs.CL
Published: March 2, 2026
PDF: Download PDF

[Paper] Recursive Models for Long-Horizon Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Tool Verification for Test-Time Reinforcement Learning

[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

[Paper] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)

[Paper] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations