[Paper] Evolving Excellence: Automated Optimization of LLM-based Agents

Published: (December 9, 2025 at 03:48 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09108v1

Overview

The paper introduces ARTEMIS, a no‑code, evolutionary‑search platform that automatically tunes the many moving parts of large‑language‑model (LLM) agents—prompts, tool descriptions, temperature, etc.—to boost real‑world performance. By treating an agent’s configuration as a genome and evolving it with semantically‑aware genetic operators, ARTEMIS can turn a “bare‑bones” agent into a high‑performing system with only a benchmark script and natural‑language goals as input.

Key Contributions

  • Joint, end‑to‑end optimization of all configurable components of an LLM agent (prompts, tool specs, hyper‑parameters) rather than optimizing each in isolation.
  • Semantically‑aware genetic operators that respect the structure of prompts and tool descriptions, enabling meaningful mutations and crossovers.
  • No‑code workflow: users supply a benchmark script and goal description; ARTEMIS discovers configurable knobs, extracts performance signals from logs, and runs the evolutionary loop automatically.
  • Broad empirical validation on four diverse agents (competitive programming, code‑optimization, cost‑aware reasoning, and a teaching bot) showing single‑digit to >30 % improvements.
  • Model‑agnostic capability: works with both commercial APIs (e.g., GPT‑4) and locally‑run open‑source models (Qwen2.5‑7B).

Methodology

  1. Configuration Discovery – ARTEMIS parses the supplied agent code to locate all user‑exposed parameters (prompt templates, tool schemas, temperature, max‑tokens, etc.).
  2. Fitness Extraction – Each agent run produces a log; domain‑specific metrics (acceptance rate, execution time, token usage, accuracy) are automatically extracted to serve as the fitness score.
  3. Evolutionary Loop
    • Population Initialization – Randomly sample values for each configurable knob within sensible bounds.
    • Selection – Keep the top‑performing configurations (elitism) and probabilistically select others for breeding.
    • Semantically‑Aware Mutation – Replace words/phrases in prompts with synonyms, reorder tool arguments, or tweak numeric hyper‑parameters while preserving syntactic validity.
    • Crossover – Combine two parent configurations by swapping whole prompt blocks or tool definitions, ensuring the offspring remain executable.
    • Evaluation – Run the agent on the benchmark script, collect fitness, and repeat for a fixed number of generations or until convergence.
  4. Result Export – The best‑found configuration is emitted as a ready‑to‑use YAML/JSON file that can replace the original agent’s defaults without any code changes.

Results & Findings

Agent (Task)Baseline MetricARTEMIS‑Improved MetricRelative Gain
ALE Agent (AtCoder Heuristic Contest)62 % acceptance70.5 % acceptance+13.6 %
Mini‑SWE Agent (SWE‑Perf code optimization)1.23× speed‑up1.35× speed‑up+10.1 % (p < 0.01)
CrewAI Agent (Math Odyssey cost‑aware reasoning)1,200 tokens per query760 tokens per query‑36.9 % token usage (p < 0.01)
MathTales‑Teacher (GSM8K with Qwen2.5‑7B)48 % accuracy58.6 % accuracy+22 %

Key takeaways

  • Joint optimization yields larger gains than tweaking prompts or hyper‑parameters alone.
  • Even modest‑size open‑source models benefit dramatically, indicating that ARTEMIS is not limited to “big‑API” LLMs.
  • The evolutionary process converges within a few dozen generations (≈ 30 – 50), requiring only a few hours of compute on a single GPU for most benchmarks.

Practical Implications

  • Rapid prototyping – Development teams can spin up a new LLM‑agent, point ARTEMIS at a representative test suite, and obtain a production‑ready configuration in hours instead of weeks.
  • Cost reduction – By minimizing token consumption (as shown with CrewAI), organizations can cut API bills substantially, especially for high‑throughput services.
  • Model‑agnostic deployment – Companies that prefer on‑premise models can still reap performance boosts without rewriting agents for each model’s quirks.
  • Continuous improvement pipelines – ARTEMIS can be integrated into CI/CD workflows: each new version of an agent or underlying LLM triggers an automated evolutionary run, guaranteeing that regressions are caught early.
  • Cross‑domain applicability – The same platform optimized agents for competitive programming, code refactoring, and educational tutoring, suggesting it can be applied to any LLM‑driven workflow (e.g., automated ticket triage, data extraction, UI generation).

Limitations & Future Work

  • Search cost – Although far cheaper than manual tuning, the evolutionary process still requires many agent executions, which may be prohibitive for extremely expensive API calls or latency‑critical systems.
  • Fitness signal quality – ARTEMIS relies on well‑defined performance metrics; ambiguous or multi‑objective goals (e.g., balancing speed vs. correctness) need more sophisticated fitness aggregation.
  • Semantic mutation scope – Current operators use synonym dictionaries and simple template swaps; richer language‑model‑guided mutations could explore a larger design space.
  • Scalability to massive configuration spaces – Agents with hundreds of knobs may suffer from premature convergence; future work could incorporate surrogate models or Bayesian optimization hybrids.
  • Human interpretability – The evolved prompts can become unintuitive; providing tools to visualize and explain why a particular wording works better would aid trust and adoption.

Bottom line: ARTEMIS demonstrates that automated, evolutionary tuning can turn “good enough” LLM agents into high‑performing, cost‑effective tools, opening the door for wider, faster adoption of agentic AI in production environments.

Authors

  • Paul Brookes
  • Vardan Voskanyan
  • Rafail Giavrimis
  • Matthew Truscott
  • Mina Ilieva
  • Chrystalla Pavlou
  • Alexandru Staicu
  • Manal Adham
  • Will Evers‑Hood
  • Jingzhi Gong
  • Kejia Zhang
  • Matvey Fedoseev
  • Vishal Sharma
  • Roman Bauer
  • Zheng Wang
  • Hema Nair
  • Wei Jie
  • Tianhua Xu
  • Aurora Constantin
  • Leslie Kanthan
  • Michail Basios

Paper Information

  • arXiv ID: 2512.09108v1
  • Categories: cs.SE, cs.AI
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »