[Paper] TodoEvolve: Learning to Architect Agent Planning Systems

Published: (February 8, 2026 at 01:37 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.07839v1

Overview

The paper presents TodoEvolve, a meta‑planning framework that can automatically design, tune, and evolve the internal planning architecture of autonomous agents. By treating the planner itself as a learnable component, TodoEvolve moves beyond static, hand‑crafted planning modules and adapts the planner’s structure to the specifics of each task and underlying model, dramatically improving performance on a range of long‑horizon problems.

Key Contributions

  • PlanFactory: a unified, modular codebase that abstracts the “shape” of a planner (topology, initialization, adaptation, navigation) and lets researchers mix‑and‑match components across very different planning paradigms.
  • Impedance‑Guided Preference Optimization (IGPO): a multi‑objective RL‑style training objective that simultaneously optimizes for (1) task performance, (2) stability of the generated planner, and (3) token‑efficiency (i.e., low API cost).
  • Todo‑14B: a 14‑billion‑parameter language model trained with IGPO to output complete planning systems (code + hyper‑parameters) on demand.
  • Empirical validation: experiments on five diverse agentic benchmarks (e.g., web navigation, code generation, embodied control) show TodoEvolve outperforms hand‑engineered planners while using fewer tokens and comparable runtime.
  • Open‑ended design space: the approach works across different backbone models (e.g., GPT‑3.5, Claude) and can be extended to new planning primitives without re‑engineering the whole system.

Methodology

  1. Define a design space – PlanFactory enumerates all plausible planner components (graph‑based search, hierarchical decomposition, memory buffers, etc.) and exposes a common API.
  2. Collect training data – The authors generate a large corpus of “planning trajectories”: for each task they sample many planner configurations, run them, and record the resulting performance, stability metrics, and token usage.
  3. Train Todo‑14B with IGPO – The model receives a task description and, via a reinforcement‑learning loop, learns to output a planner configuration that maximizes a weighted sum of three rewards:
    • Performance: success rate / reward on the task.
    • Stability: low variance across runs, avoiding crashes or dead‑ends.
    • Token‑efficiency: penalizing planners that require many LLM calls.
      The “impedance” term in IGPO measures how far a candidate planner deviates from an ideal trade‑off surface, guiding the optimizer toward balanced solutions.
  4. Dynamic revision – At inference time, TodoEvolve can re‑evaluate the generated planner on the fly and suggest incremental revisions (e.g., adding a memory module) if the observed impedance rises, effectively evolving the planner while the agent is running.

Results & Findings

BenchmarkBaseline Planner (hand‑crafted)TodoEvolve (best)Token SavingsRuntime Δ
WebNav (multi‑page browsing)71.2 % success78.9 %~23 %+5 %
CodeAssist (complex code generation)64.5 %71.3 %~19 %+3 %
Embodied‑Room (simulated robot)58.0 %66.4 %~27 %+7 %
Multi‑step QA73.1 %80.2 %~21 %+4 %
Strategy Game (turn‑based)69.8 %77.5 %~22 %+6 %
  • Across all tasks, TodoEvolve consistently beats the strongest manually engineered planner by 5–9 percentage points.
  • The IGPO‑trained model produces planners that are more stable (lower variance in success rates across random seeds).
  • Token usage drops by roughly 20 %, translating to lower API costs for LLM‑backed agents.
  • The additional runtime overhead is modest (single‑digit percent), making the approach practical for production systems.

Practical Implications

  • Plug‑and‑play planner generation – Developers can call TodoEvolve as a service: give it a task description, receive a ready‑to‑run planning module, and drop it into any existing agent pipeline.
  • Cost‑effective scaling – Because the generated planners are token‑efficient, cloud‑based agents (e.g., ChatGPT plugins, autonomous assistants) can handle more requests within the same budget.
  • Rapid prototyping – Instead of hand‑tuning search depth, memory size, or hierarchical decomposition, teams can iterate by simply re‑prompting TodoEvolve, dramatically shortening the R&D cycle for new domains (e.g., finance, healthcare).
  • Cross‑model portability – The design space abstracts away the underlying LLM, so the same planner can be re‑used with GPT‑4, Claude, or open‑source alternatives, easing migration between providers.
  • Self‑optimizing agents – In long‑running deployments (e.g., autonomous drones), the agent can monitor its own impedance and request a planner revision mid‑mission, leading to more resilient behavior without human intervention.

Limitations & Future Work

  • Design‑space coverage – PlanFactory, while extensive, still reflects the authors’ bias toward known planning paradigms; exotic or domain‑specific structures may be missing.
  • Training cost – Building the high‑quality trajectory dataset and training a 14B model with IGPO requires substantial compute, which could be a barrier for smaller labs.
  • Stability‑vs‑Exploration trade‑off – The impedance term can over‑penalize novel planner configurations, potentially limiting discovery of radically new architectures.
  • Real‑world deployment – All benchmarks are simulated; testing on truly noisy, safety‑critical environments (e.g., robotics in the wild) remains an open step.

Future research directions include expanding PlanFactory with community‑contributed modules, applying meta‑learning to reduce the data‑generation burden, and integrating safety constraints directly into the IGPO objective.

Authors

  • Jiaxi Liu
  • Yanzuo Jiang
  • Guibin Zhang
  • Zihan Zhang
  • Heng Chang
  • Zhenfei Yin
  • Qibing Ren
  • Junchi Yan

Paper Information

  • arXiv ID: 2602.07839v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: February 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »