[Paper] AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse
Source: arXiv - 2603.18000v1
Overview
The paper introduces AgentFactory, a novel framework that lets large‑language‑model (LLM) agents self‑evolve by turning successful task executions into reusable pieces of executable Python subagents. Instead of merely storing textual prompts or reflections, AgentFactory captures working code, refines it with feedback, and builds a growing library that can be called on for future problems—much like a personal code‑reuse toolbox for AI agents.
Key Contributions
- Executable Subagent Accumulation – Successful LLM‑driven solutions are saved as clean, documented Python modules rather than raw text.
- Continuous Refinement Loop – Each subagent is re‑executed on new tasks; runtime feedback (success/failure, performance metrics) automatically triggers incremental improvements.
- Standardized Interface & Documentation – All subagents follow a common API and include auto‑generated docstrings, making them plug‑and‑play across any Python environment.
- Demonstrated Capability Growth – Empirical evaluation shows the library’s size and effectiveness increase over time, reducing the number of calls to the underlying LLM for similar tasks.
- Open‑source Implementation – Full codebase and demo video are publicly released, encouraging community extensions and real‑world testing.
Methodology
- Task Execution & Observation – An LLM agent receives a user request, generates a Python script (the subagent), and runs it in a sandboxed environment. Execution logs (outputs, errors, runtime) are captured.
- Success Detection – Simple heuristics (e.g., test‑case passing, absence of exceptions, or user‑provided validation) label the run as successful or needs improvement.
- Subagent Registration – Successful scripts are stored in a versioned repository with metadata: input description, performance stats, and a generated docstring describing the function signature and purpose.
- Reuse & Composition – When a new request arrives, the system first searches the subagent library for a matching capability (using semantic similarity on the request text and stored metadata). If a match is found, the existing subagent is invoked directly or composed with others.
- Feedback‑Driven Refinement – If a reused subagent fails or performs sub‑optimally, the LLM is prompted to patch the code based on the observed error, creating a new version that replaces the older one in the library.
- Evaluation Loop – Over a sequence of benchmark tasks, the authors track how often the system can answer using stored subagents versus generating fresh code, measuring both success rate and LLM token consumption.
The entire pipeline is built on pure Python, leveraging existing LLM APIs (e.g., OpenAI, Anthropic) for code generation and using standard tools like pytest for automated validation.
Results & Findings
| Metric | Baseline (fresh LLM generation) | AgentFactory (reuse‑first) |
|---|---|---|
| Success Rate | 78 % | 86 % (after 5k tasks) |
| Average LLM Tokens per Task | 1,200 | 540 (≈55 % reduction) |
| Time to Solve Repeated Tasks | 12 s | 4 s (subagent call) |
| Library Size after 10k tasks | – | 1,342 distinct subagents |
Key observations:
- Capability Accumulation – The library grew steadily, and the proportion of tasks solved by reusing existing subagents rose from ~20 % early on to >70 % after several thousand tasks.
- Robustness Gains – Re‑executed subagents showed fewer runtime errors after each refinement cycle, indicating that the feedback loop effectively “debugs” itself.
- Portability – Because subagents are plain Python modules with no hidden state, they can be exported to other projects or deployed on edge devices without needing the original LLM.
Practical Implications
- Reduced Cloud Costs – By cutting the number of LLM calls, organizations can lower API usage fees, especially in high‑throughput automation pipelines (e.g., data cleaning, report generation).
- Faster Turnaround for Repetitive Tasks – Developers can treat the subagent library like a personal SDK; invoking a stored function is orders of magnitude quicker than prompting an LLM each time.
- Improved Reliability – Continuous self‑debugging means the system becomes more stable over time, which is valuable for production‑grade agents that must meet SLAs.
- Easier Auditing & Compliance – Since each subagent is documented code, teams can review, test, and certify the exact logic that will run—something that’s hard to do with opaque prompt‑only memories.
- Plug‑and‑Play Across Projects – The standardized API lets different teams share subagents, fostering cross‑project knowledge transfer without re‑training or fine‑tuning models.
Limitations & Future Work
- Scope of Reusability – The current similarity search works best for tasks with clear, deterministic inputs/outputs; highly creative or context‑heavy requests still rely on fresh LLM generation.
- Safety Guarantees – While sandboxing limits damage, automatically executing generated code poses security risks; the authors suggest tighter static analysis and permission‑scoped runtimes as next steps.
- Scalability of the Library – As the number of subagents grows, indexing and retrieval latency could become a bottleneck; future work may explore hierarchical clustering or learned retrieval models.
- Generalization Beyond Python – Extending the paradigm to other runtimes (JavaScript, Rust) would broaden applicability, but would require language‑specific validation pipelines.
Overall, AgentFactory offers a compelling blueprint for turning LLM‑driven agents into self‑improving, code‑centric assistants—bridging the gap between generative AI and traditional software engineering practices.
Authors
- Zhang Zhang
- Shuqi Lu
- Hongjin Qian
- Di He
- Zheng Liu
Paper Information
- arXiv ID: 2603.18000v1
- Categories: cs.AI
- Published: March 18, 2026
- PDF: Download PDF