[Paper] Evolution of Benchmark: Black-Box Optimization Benchmark Design through Large Language Model

Published: (January 29, 2026 at 10:45 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.21877v1

Overview

The paper introduces Evolution of Benchmark (EoB), a system that automatically generates black‑box optimization (BBO) test functions using a large language model (LLM). By treating benchmark creation as an optimization problem itself, EoB produces diverse, unbiased problem landscapes that better differentiate between solvers—opening the door to more reliable algorithm evaluation and data‑driven optimizer design.

Key Contributions

  • LLM‑driven benchmark synthesis: Leverages the generative and program‑evolution abilities of modern LLMs to create executable benchmark functions without human hand‑crafting.
  • Bi‑objective formulation: Simultaneously maximizes (i) landscape diversity and (ii) the ability of a benchmark set to distinguish (differentiate) among a portfolio of BBO algorithms.
  • Co‑evolutionary loop: Introduces a reflection‑based scheme where candidate benchmark programs and their resulting landscapes evolve together, guided by feedback from solver performance.
  • Multi‑purpose utility: Demonstrates that the generated benchmarks are effective for (1) standard algorithm benchmarking, (2) training/testing learning‑assisted BBO methods, and (3) serving as proxies for expensive real‑world optimization problems.
  • Extensive empirical validation: Shows that EoB‑generated suites rival or surpass classic human‑designed benchmark collections across several evaluation criteria.

Methodology

  1. Problem encoding: Each benchmark is represented as a short Python (or similar) program that maps a vector of decision variables to a scalar fitness value.
  2. Population initialization: An LLM is prompted with a template and a few seed examples to produce an initial pool of benchmark programs.
  3. Landscape evaluation: For every candidate program, a set of representative BBO solvers (e.g., CMA‑ES, DE, PSO) is run. Two metrics are extracted:
    • Diversity – statistical spread of landscape features (e.g., modality, ruggedness).
    • Differentiation – variance in solver performance rankings on that landscape.
  4. Bi‑objective optimization: Using a multi‑objective evolutionary algorithm (e.g., NSGA‑II), the system selects programs that jointly improve diversity and differentiation.
  5. Reflection‑based prompting: The LLM receives feedback (the “reflection”) about which aspects of the current programs performed well or poorly, and is asked to generate mutated or entirely new programs accordingly.
  6. Iterative co‑evolution: Steps 3‑5 repeat until convergence or a budget limit, yielding a final benchmark suite that balances the two objectives.

The whole pipeline runs automatically once the initial prompts and solver portfolio are defined, requiring minimal human oversight.

Results & Findings

EvaluationHuman‑crafted suites (e.g., BBOB)EoB‑generated suites
Landscape diversity (feature spread)ModerateHigher (≈30 % increase)
Algorithm differentiation (ranking variance)Low‑to‑moderateSignificantly higher (≈45 % increase)
Predictive power for learning‑assisted optimizersBaselineImproved test‑set performance (≈10 % lower regret)
Proxy quality for expensive real‑world problemsLimited transferBetter correlation with real‑world objective values (R² ↑ 0.12)

Key takeaways

  • EoB’s benchmarks expose strengths and weaknesses of solvers more clearly than traditional suites.
  • When used to train surrogate‑based or reinforcement‑learning BBO methods, the generated problems lead to models that generalize better to unseen tasks.
  • The automatically created proxy functions can replace costly simulations in early‑stage algorithm development, cutting computational budgets by up to 40 %.

Practical Implications

  • Accelerated algorithm development: Teams can spin up a custom benchmark suite in minutes, tailored to the specific solvers they care about, without waiting for community‑curated collections.
  • More trustworthy benchmarking: By reducing human bias in problem design, performance claims become harder to overfit, fostering fairer competition among BBO libraries (e.g., Nevergrad, PyGMO).
  • Data‑driven optimizer training: Researchers building learning‑assisted optimizers (meta‑learners, neural surrogates) gain a richer, automatically refreshed training set, improving robustness.
  • Rapid prototyping for expensive domains: Industries such as aerospace design, drug discovery, or finance can use EoB‑generated proxies to evaluate algorithmic ideas before committing to expensive simulations or wet‑lab experiments.
  • Open‑source integration: Because EoB operates through standard LLM APIs and produces plain Python functions, it can be wrapped into CI pipelines or benchmark‑as‑a‑service platforms.

Limitations & Future Work

  • LLM dependency: The quality of generated benchmarks hinges on the underlying LLM’s code synthesis capabilities; outdated or smaller models may produce syntactically correct but mathematically trivial functions.
  • Computational cost of evaluation: Running multiple solvers on each candidate landscape is still expensive for high‑dimensional problems; smarter surrogate‑based evaluation could reduce this overhead.
  • Scope of problem domains: Current experiments focus on continuous, unconstrained spaces; extending EoB to combinatorial, constrained, or multi‑objective settings remains an open challenge.
  • Explainability: While the benchmarks are executable code, understanding why a particular landscape yields high differentiation is non‑trivial; future work could incorporate feature‑level introspection or symbolic analysis.

Overall, the paper demonstrates that large language models can move beyond code completion to become creative designers of scientific artifacts—here, the very testbeds that drive progress in black‑box optimization.

Authors

  • Chen Wang
  • Sijie Ma
  • Zeyuan Ma
  • Yue‑Jiao Gong

Paper Information

  • arXiv ID: 2601.21877v1
  • Categories: cs.NE
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »