[Paper] TodoEvolve: Learning to Architect Agent Planning Systems

Published: 3 days ago (February 8, 2026 at 01:37 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.07839v1

Overview

The paper presents TodoEvolve, a meta‑planning framework that can automatically design, tune, and evolve the internal planning architecture of autonomous agents. By treating the planner itself as a learnable component, TodoEvolve moves beyond static, hand‑crafted planning modules and adapts the planner’s structure to the specifics of each task and underlying model, dramatically improving performance on a range of long‑horizon problems.

Key Contributions

PlanFactory: a unified, modular codebase that abstracts the “shape” of a planner (topology, initialization, adaptation, navigation) and lets researchers mix‑and‑match components across very different planning paradigms.
Impedance‑Guided Preference Optimization (IGPO): a multi‑objective RL‑style training objective that simultaneously optimizes for (1) task performance, (2) stability of the generated planner, and (3) token‑efficiency (i.e., low API cost).
Todo‑14B: a 14‑billion‑parameter language model trained with IGPO to output complete planning systems (code + hyper‑parameters) on demand.
Empirical validation: experiments on five diverse agentic benchmarks (e.g., web navigation, code generation, embodied control) show TodoEvolve outperforms hand‑engineered planners while using fewer tokens and comparable runtime.
Open‑ended design space: the approach works across different backbone models (e.g., GPT‑3.5, Claude) and can be extended to new planning primitives without re‑engineering the whole system.

Methodology

Define a design space – PlanFactory enumerates all plausible planner components (graph‑based search, hierarchical decomposition, memory buffers, etc.) and exposes a common API.
Collect training data – The authors generate a large corpus of “planning trajectories”: for each task they sample many planner configurations, run them, and record the resulting performance, stability metrics, and token usage.
Train Todo‑14B with IGPO – The model receives a task description and, via a reinforcement‑learning loop, learns to output a planner configuration that maximizes a weighted sum of three rewards:
- Performance: success rate / reward on the task.
- Stability: low variance across runs, avoiding crashes or dead‑ends.
- Token‑efficiency: penalizing planners that require many LLM calls.
  The “impedance” term in IGPO measures how far a candidate planner deviates from an ideal trade‑off surface, guiding the optimizer toward balanced solutions.
Dynamic revision – At inference time, TodoEvolve can re‑evaluate the generated planner on the fly and suggest incremental revisions (e.g., adding a memory module) if the observed impedance rises, effectively evolving the planner while the agent is running.

Results & Findings

Benchmark	Baseline Planner (hand‑crafted)	TodoEvolve (best)	Token Savings	Runtime Δ
WebNav (multi‑page browsing)	71.2 % success	78.9 %	~23 %	+5 %
CodeAssist (complex code generation)	64.5 %	71.3 %	~19 %	+3 %
Embodied‑Room (simulated robot)	58.0 %	66.4 %	~27 %	+7 %
Multi‑step QA	73.1 %	80.2 %	~21 %	+4 %
Strategy Game (turn‑based)	69.8 %	77.5 %	~22 %	+6 %

Across all tasks, TodoEvolve consistently beats the strongest manually engineered planner by 5–9 percentage points.
The IGPO‑trained model produces planners that are more stable (lower variance in success rates across random seeds).
Token usage drops by roughly 20 %, translating to lower API costs for LLM‑backed agents.
The additional runtime overhead is modest (single‑digit percent), making the approach practical for production systems.

Practical Implications

Plug‑and‑play planner generation – Developers can call TodoEvolve as a service: give it a task description, receive a ready‑to‑run planning module, and drop it into any existing agent pipeline.
Cost‑effective scaling – Because the generated planners are token‑efficient, cloud‑based agents (e.g., ChatGPT plugins, autonomous assistants) can handle more requests within the same budget.
Rapid prototyping – Instead of hand‑tuning search depth, memory size, or hierarchical decomposition, teams can iterate by simply re‑prompting TodoEvolve, dramatically shortening the R&D cycle for new domains (e.g., finance, healthcare).
Cross‑model portability – The design space abstracts away the underlying LLM, so the same planner can be re‑used with GPT‑4, Claude, or open‑source alternatives, easing migration between providers.
Self‑optimizing agents – In long‑running deployments (e.g., autonomous drones), the agent can monitor its own impedance and request a planner revision mid‑mission, leading to more resilient behavior without human intervention.

Limitations & Future Work

Design‑space coverage – PlanFactory, while extensive, still reflects the authors’ bias toward known planning paradigms; exotic or domain‑specific structures may be missing.
Training cost – Building the high‑quality trajectory dataset and training a 14B model with IGPO requires substantial compute, which could be a barrier for smaller labs.
Stability‑vs‑Exploration trade‑off – The impedance term can over‑penalize novel planner configurations, potentially limiting discovery of radically new architectures.
Real‑world deployment – All benchmarks are simulated; testing on truly noisy, safety‑critical environments (e.g., robotics in the wild) remains an open step.

Future research directions include expanding PlanFactory with community‑contributed modules, applying meta‑learning to reduce the data‑generation burden, and integrating safety constraints directly into the IGPO objective.

Authors

Jiaxi Liu
Yanzuo Jiang
Guibin Zhang
Zihan Zhang
Heng Chang
Zhenfei Yin
Qibing Ren
Junchi Yan

Paper Information

arXiv ID: 2602.07839v1
Categories: cs.CL, cs.AI, cs.LG
Published: February 8, 2026
PDF: Download PDF

[Paper] TodoEvolve: Learning to Architect Agent Planning Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

[Paper] Anagent For Enhancing Scientific Table & Figure Analysis

[Paper] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

[Paper] A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

[Paper] Anagent For Enhancing Scientific Table &amp; Figure Analysis

[Paper] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

[Paper] A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

[Paper] Anagent For Enhancing Scientific Table & Figure Analysis