[Paper] Evolving Excellence: Automated Optimization of LLM-based Agents

Published: 2 months ago (December 9, 2025 at 03:48 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09108v1

Overview

The paper introduces ARTEMIS, a no‑code, evolutionary‑search platform that automatically tunes the many moving parts of large‑language‑model (LLM) agents—prompts, tool descriptions, temperature, etc.—to boost real‑world performance. By treating an agent’s configuration as a genome and evolving it with semantically‑aware genetic operators, ARTEMIS can turn a “bare‑bones” agent into a high‑performing system with only a benchmark script and natural‑language goals as input.

Key Contributions

Joint, end‑to‑end optimization of all configurable components of an LLM agent (prompts, tool specs, hyper‑parameters) rather than optimizing each in isolation.
Semantically‑aware genetic operators that respect the structure of prompts and tool descriptions, enabling meaningful mutations and crossovers.
No‑code workflow: users supply a benchmark script and goal description; ARTEMIS discovers configurable knobs, extracts performance signals from logs, and runs the evolutionary loop automatically.
Broad empirical validation on four diverse agents (competitive programming, code‑optimization, cost‑aware reasoning, and a teaching bot) showing single‑digit to >30 % improvements.
Model‑agnostic capability: works with both commercial APIs (e.g., GPT‑4) and locally‑run open‑source models (Qwen2.5‑7B).

Methodology

Configuration Discovery – ARTEMIS parses the supplied agent code to locate all user‑exposed parameters (prompt templates, tool schemas, temperature, max‑tokens, etc.).
Fitness Extraction – Each agent run produces a log; domain‑specific metrics (acceptance rate, execution time, token usage, accuracy) are automatically extracted to serve as the fitness score.
Evolutionary Loop
- Population Initialization – Randomly sample values for each configurable knob within sensible bounds.
- Selection – Keep the top‑performing configurations (elitism) and probabilistically select others for breeding.
- Semantically‑Aware Mutation – Replace words/phrases in prompts with synonyms, reorder tool arguments, or tweak numeric hyper‑parameters while preserving syntactic validity.
- Crossover – Combine two parent configurations by swapping whole prompt blocks or tool definitions, ensuring the offspring remain executable.
- Evaluation – Run the agent on the benchmark script, collect fitness, and repeat for a fixed number of generations or until convergence.
Result Export – The best‑found configuration is emitted as a ready‑to‑use YAML/JSON file that can replace the original agent’s defaults without any code changes.

Results & Findings

Agent (Task)	Baseline Metric	ARTEMIS‑Improved Metric	Relative Gain
ALE Agent (AtCoder Heuristic Contest)	62 % acceptance	70.5 % acceptance	+13.6 %
Mini‑SWE Agent (SWE‑Perf code optimization)	1.23× speed‑up	1.35× speed‑up	+10.1 % (p < 0.01)
CrewAI Agent (Math Odyssey cost‑aware reasoning)	1,200 tokens per query	760 tokens per query	‑36.9 % token usage (p < 0.01)
MathTales‑Teacher (GSM8K with Qwen2.5‑7B)	48 % accuracy	58.6 % accuracy	+22 %

Key takeaways

Joint optimization yields larger gains than tweaking prompts or hyper‑parameters alone.
Even modest‑size open‑source models benefit dramatically, indicating that ARTEMIS is not limited to “big‑API” LLMs.
The evolutionary process converges within a few dozen generations (≈ 30 – 50), requiring only a few hours of compute on a single GPU for most benchmarks.

Practical Implications

Rapid prototyping – Development teams can spin up a new LLM‑agent, point ARTEMIS at a representative test suite, and obtain a production‑ready configuration in hours instead of weeks.
Cost reduction – By minimizing token consumption (as shown with CrewAI), organizations can cut API bills substantially, especially for high‑throughput services.
Model‑agnostic deployment – Companies that prefer on‑premise models can still reap performance boosts without rewriting agents for each model’s quirks.
Continuous improvement pipelines – ARTEMIS can be integrated into CI/CD workflows: each new version of an agent or underlying LLM triggers an automated evolutionary run, guaranteeing that regressions are caught early.
Cross‑domain applicability – The same platform optimized agents for competitive programming, code refactoring, and educational tutoring, suggesting it can be applied to any LLM‑driven workflow (e.g., automated ticket triage, data extraction, UI generation).

Limitations & Future Work

Search cost – Although far cheaper than manual tuning, the evolutionary process still requires many agent executions, which may be prohibitive for extremely expensive API calls or latency‑critical systems.
Fitness signal quality – ARTEMIS relies on well‑defined performance metrics; ambiguous or multi‑objective goals (e.g., balancing speed vs. correctness) need more sophisticated fitness aggregation.
Semantic mutation scope – Current operators use synonym dictionaries and simple template swaps; richer language‑model‑guided mutations could explore a larger design space.
Scalability to massive configuration spaces – Agents with hundreds of knobs may suffer from premature convergence; future work could incorporate surrogate models or Bayesian optimization hybrids.
Human interpretability – The evolved prompts can become unintuitive; providing tools to visualize and explain why a particular wording works better would aid trust and adoption.

Bottom line: ARTEMIS demonstrates that automated, evolutionary tuning can turn “good enough” LLM agents into high‑performing, cost‑effective tools, opening the door for wider, faster adoption of agentic AI in production environments.

Authors

Paul Brookes
Vardan Voskanyan
Rafail Giavrimis
Matthew Truscott
Mina Ilieva
Chrystalla Pavlou
Alexandru Staicu
Manal Adham
Will Evers‑Hood
Jingzhi Gong
Kejia Zhang
Matvey Fedoseev
Vishal Sharma
Roman Bauer
Zheng Wang
Hema Nair
Wei Jie
Tianhua Xu
Aurora Constantin
Leslie Kanthan
Michail Basios

Paper Information

arXiv ID: 2512.09108v1
Categories: cs.SE, cs.AI
Published: December 9, 2025
PDF: Download PDF

[Paper] Evolving Excellence: Automated Optimization of LLM-based Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously