[Paper] AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Published: 2 days ago (March 18, 2026 at 01:58 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.18000v1

Overview

The paper introduces AgentFactory, a novel framework that lets large‑language‑model (LLM) agents self‑evolve by turning successful task executions into reusable pieces of executable Python subagents. Instead of merely storing textual prompts or reflections, AgentFactory captures working code, refines it with feedback, and builds a growing library that can be called on for future problems—much like a personal code‑reuse toolbox for AI agents.

Key Contributions

Executable Subagent Accumulation – Successful LLM‑driven solutions are saved as clean, documented Python modules rather than raw text.
Continuous Refinement Loop – Each subagent is re‑executed on new tasks; runtime feedback (success/failure, performance metrics) automatically triggers incremental improvements.
Standardized Interface & Documentation – All subagents follow a common API and include auto‑generated docstrings, making them plug‑and‑play across any Python environment.
Demonstrated Capability Growth – Empirical evaluation shows the library’s size and effectiveness increase over time, reducing the number of calls to the underlying LLM for similar tasks.
Open‑source Implementation – Full codebase and demo video are publicly released, encouraging community extensions and real‑world testing.

Methodology

Task Execution & Observation – An LLM agent receives a user request, generates a Python script (the subagent), and runs it in a sandboxed environment. Execution logs (outputs, errors, runtime) are captured.
Success Detection – Simple heuristics (e.g., test‑case passing, absence of exceptions, or user‑provided validation) label the run as successful or needs improvement.
Subagent Registration – Successful scripts are stored in a versioned repository with metadata: input description, performance stats, and a generated docstring describing the function signature and purpose.
Reuse & Composition – When a new request arrives, the system first searches the subagent library for a matching capability (using semantic similarity on the request text and stored metadata). If a match is found, the existing subagent is invoked directly or composed with others.
Feedback‑Driven Refinement – If a reused subagent fails or performs sub‑optimally, the LLM is prompted to patch the code based on the observed error, creating a new version that replaces the older one in the library.
Evaluation Loop – Over a sequence of benchmark tasks, the authors track how often the system can answer using stored subagents versus generating fresh code, measuring both success rate and LLM token consumption.

The entire pipeline is built on pure Python, leveraging existing LLM APIs (e.g., OpenAI, Anthropic) for code generation and using standard tools like pytest for automated validation.

Results & Findings

Metric	Baseline (fresh LLM generation)	AgentFactory (reuse‑first)
Success Rate	78 %	86 % (after 5k tasks)
Average LLM Tokens per Task	1,200	540 (≈55 % reduction)
Time to Solve Repeated Tasks	12 s	4 s (subagent call)
Library Size after 10k tasks	–	1,342 distinct subagents

Key observations:

Capability Accumulation – The library grew steadily, and the proportion of tasks solved by reusing existing subagents rose from ~20 % early on to >70 % after several thousand tasks.
Robustness Gains – Re‑executed subagents showed fewer runtime errors after each refinement cycle, indicating that the feedback loop effectively “debugs” itself.
Portability – Because subagents are plain Python modules with no hidden state, they can be exported to other projects or deployed on edge devices without needing the original LLM.

Practical Implications

Reduced Cloud Costs – By cutting the number of LLM calls, organizations can lower API usage fees, especially in high‑throughput automation pipelines (e.g., data cleaning, report generation).
Faster Turnaround for Repetitive Tasks – Developers can treat the subagent library like a personal SDK; invoking a stored function is orders of magnitude quicker than prompting an LLM each time.
Improved Reliability – Continuous self‑debugging means the system becomes more stable over time, which is valuable for production‑grade agents that must meet SLAs.
Easier Auditing & Compliance – Since each subagent is documented code, teams can review, test, and certify the exact logic that will run—something that’s hard to do with opaque prompt‑only memories.
Plug‑and‑Play Across Projects – The standardized API lets different teams share subagents, fostering cross‑project knowledge transfer without re‑training or fine‑tuning models.

Limitations & Future Work

Scope of Reusability – The current similarity search works best for tasks with clear, deterministic inputs/outputs; highly creative or context‑heavy requests still rely on fresh LLM generation.
Safety Guarantees – While sandboxing limits damage, automatically executing generated code poses security risks; the authors suggest tighter static analysis and permission‑scoped runtimes as next steps.
Scalability of the Library – As the number of subagents grows, indexing and retrieval latency could become a bottleneck; future work may explore hierarchical clustering or learned retrieval models.
Generalization Beyond Python – Extending the paradigm to other runtimes (JavaScript, Rust) would broaden applicability, but would require language‑specific validation pipelines.

Overall, AgentFactory offers a compelling blueprint for turning LLM‑driven agents into self‑improving, code‑centric assistants—bridging the gap between generative AI and traditional software engineering practices.

Authors

Zhang Zhang
Shuqi Lu
Hongjin Qian
Di He
Zheng Liu

Paper Information

arXiv ID: 2603.18000v1
Categories: cs.AI
Published: March 18, 2026
PDF: Download PDF

[Paper] AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules