[Paper] Evolution of Benchmark: Black-Box Optimization Benchmark Design through Large Language Model

Published: 3 months ago (January 29, 2026 at 10:45 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.21877v1

Overview

The paper introduces Evolution of Benchmark (EoB), a system that automatically generates black‑box optimization (BBO) test functions using a large language model (LLM). By treating benchmark creation as an optimization problem itself, EoB produces diverse, unbiased problem landscapes that better differentiate between solvers—opening the door to more reliable algorithm evaluation and data‑driven optimizer design.

Key Contributions

LLM‑driven benchmark synthesis: Leverages the generative and program‑evolution abilities of modern LLMs to create executable benchmark functions without human hand‑crafting.
Bi‑objective formulation: Simultaneously maximizes (i) landscape diversity and (ii) the ability of a benchmark set to distinguish (differentiate) among a portfolio of BBO algorithms.
Co‑evolutionary loop: Introduces a reflection‑based scheme where candidate benchmark programs and their resulting landscapes evolve together, guided by feedback from solver performance.
Multi‑purpose utility: Demonstrates that the generated benchmarks are effective for (1) standard algorithm benchmarking, (2) training/testing learning‑assisted BBO methods, and (3) serving as proxies for expensive real‑world optimization problems.
Extensive empirical validation: Shows that EoB‑generated suites rival or surpass classic human‑designed benchmark collections across several evaluation criteria.

Methodology

Problem encoding: Each benchmark is represented as a short Python (or similar) program that maps a vector of decision variables to a scalar fitness value.
Population initialization: An LLM is prompted with a template and a few seed examples to produce an initial pool of benchmark programs.
Landscape evaluation: For every candidate program, a set of representative BBO solvers (e.g., CMA‑ES, DE, PSO) is run. Two metrics are extracted:
- Diversity – statistical spread of landscape features (e.g., modality, ruggedness).
- Differentiation – variance in solver performance rankings on that landscape.
Bi‑objective optimization: Using a multi‑objective evolutionary algorithm (e.g., NSGA‑II), the system selects programs that jointly improve diversity and differentiation.
Reflection‑based prompting: The LLM receives feedback (the “reflection”) about which aspects of the current programs performed well or poorly, and is asked to generate mutated or entirely new programs accordingly.
Iterative co‑evolution: Steps 3‑5 repeat until convergence or a budget limit, yielding a final benchmark suite that balances the two objectives.

The whole pipeline runs automatically once the initial prompts and solver portfolio are defined, requiring minimal human oversight.

Results & Findings

Evaluation	Human‑crafted suites (e.g., BBOB)	EoB‑generated suites
Landscape diversity (feature spread)	Moderate	Higher (≈30 % increase)
Algorithm differentiation (ranking variance)	Low‑to‑moderate	Significantly higher (≈45 % increase)
Predictive power for learning‑assisted optimizers	Baseline	Improved test‑set performance (≈10 % lower regret)
Proxy quality for expensive real‑world problems	Limited transfer	Better correlation with real‑world objective values (R² ↑ 0.12)

Key takeaways

EoB’s benchmarks expose strengths and weaknesses of solvers more clearly than traditional suites.
When used to train surrogate‑based or reinforcement‑learning BBO methods, the generated problems lead to models that generalize better to unseen tasks.
The automatically created proxy functions can replace costly simulations in early‑stage algorithm development, cutting computational budgets by up to 40 %.

Practical Implications

Accelerated algorithm development: Teams can spin up a custom benchmark suite in minutes, tailored to the specific solvers they care about, without waiting for community‑curated collections.
More trustworthy benchmarking: By reducing human bias in problem design, performance claims become harder to overfit, fostering fairer competition among BBO libraries (e.g., Nevergrad, PyGMO).
Data‑driven optimizer training: Researchers building learning‑assisted optimizers (meta‑learners, neural surrogates) gain a richer, automatically refreshed training set, improving robustness.
Rapid prototyping for expensive domains: Industries such as aerospace design, drug discovery, or finance can use EoB‑generated proxies to evaluate algorithmic ideas before committing to expensive simulations or wet‑lab experiments.
Open‑source integration: Because EoB operates through standard LLM APIs and produces plain Python functions, it can be wrapped into CI pipelines or benchmark‑as‑a‑service platforms.

Limitations & Future Work

LLM dependency: The quality of generated benchmarks hinges on the underlying LLM’s code synthesis capabilities; outdated or smaller models may produce syntactically correct but mathematically trivial functions.
Computational cost of evaluation: Running multiple solvers on each candidate landscape is still expensive for high‑dimensional problems; smarter surrogate‑based evaluation could reduce this overhead.
Scope of problem domains: Current experiments focus on continuous, unconstrained spaces; extending EoB to combinatorial, constrained, or multi‑objective settings remains an open challenge.
Explainability: While the benchmarks are executable code, understanding why a particular landscape yields high differentiation is non‑trivial; future work could incorporate feature‑level introspection or symbolic analysis.

Overall, the paper demonstrates that large language models can move beyond code completion to become creative designers of scientific artifacts—here, the very testbeds that drive progress in black‑box optimization.

Authors

Chen Wang
Sijie Ma
Zeyuan Ma
Yue‑Jiao Gong

Paper Information

arXiv ID: 2601.21877v1
Categories: cs.NE
Published: January 29, 2026
PDF: Download PDF

[Paper] Evolution of Benchmark: Black-Box Optimization Benchmark Design through Large Language Model

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces