[Paper] Code World Models for Parameter Control in Evolutionary Algorithms
Source: arXiv - 2602.22260v1
Overview
The paper Code World Models for Parameter Control in Evolutionary Algorithms explores whether large language models (LLMs) can learn the inner workings of an optimizer and then steer it to better performance. By teaching an LLM to synthesize a tiny Python “world model” that predicts how a simple evolutionary algorithm will behave, the authors show that greedy planning on this model can automatically pick the right mutation strength—without ever seeing an optimal‑policy run.
Key Contributions
- Extension of Code World Models (CWMs) from deterministic games to stochastic combinatorial optimization problems.
- LLM‑generated simulators of the (1+1)-RLS(_k) optimizer that predict state transitions given a mutation strength (k).
- Greedy planning over the learned simulator to select (k) on‑the‑fly, achieving near‑optimal control policies.
- Empirical validation on classic benchmarks (LeadingOnes, OneMax, Jump(_k), NK‑Landscapes) showing:
- ≤ 6 % performance gap to the theoretical optimum on deterministic problems.
- 100 % success on the deceptive Jump(_k) problem where adaptive baselines completely fail.
- Statistically significant improvements over state‑of‑the‑art baselines (including DQN) on NK‑Landscapes.
- Sample‑efficiency and robustness: only 200 offline trajectories needed versus 500 online episodes for DQN; consistent synthesis across multiple random seeds.
Methodology
- Data Collection – Run a sub‑optimal version of the (1+1)-RLS(_k) algorithm on a target problem and record short trajectories (state, chosen (k), resulting fitness).
- Prompt Engineering – Feed these trajectories to a large language model (e.g., GPT‑4) together with a short natural‑language description of the task. The prompt asks the model to write a Python function that, given a current solution and a candidate (k), predicts the next solution’s fitness.
- World Model Synthesis – The LLM outputs executable Python code (the “Code World Model”). The code is automatically tested on a held‑out set of trajectories; if it passes, it becomes the simulator.
- Greedy Planning – At each iteration of the real optimizer, the simulator is queried for each admissible (k). The algorithm picks the (k) that the simulator predicts will give the highest immediate fitness gain (a one‑step look‑ahead).
- Evaluation – Compare the CWM‑guided optimizer against:
- Theoretical optimal policies (where known).
- Classic adaptive schemes (e.g., self‑adjusting mutation rates).
- Model‑free reinforcement learning (DQN).
All steps are fully offline: the LLM never sees the optimal policy, only the noisy sub‑optimal data.
Results & Findings
| Benchmark | Baseline (best) | CWM‑Greedy | Gap to Optimum |
|---|---|---|---|
| LeadingOnes | Adaptive (k) ≈ 0.94 | 0.96 | ≤ 6 % |
| OneMax | Adaptive (k) ≈ 0.98 | 0.99 | ≤ 6 % |
| Jump(_k) (deceptive) | 0 % success (all adaptive) | 100 % | — |
| NK‑Landscape (15 instances) | Avg. fitness 36.32 | 36.94 (p < 0.001) | — |
Additional observations:
- Sample efficiency – CWM needed only 200 offline trajectories to outperform DQN trained on 500 online episodes.
- Generalization – A model trained on (k=3) transferred to unseen (k) values with 78 % success, whereas DQN failed completely.
- Stability – Re‑running the whole pipeline five times produced virtually identical world models and performance curves.
Practical Implications
- Plug‑and‑play optimizer tuning – Developers can hand a few dozen runs of a baseline evolutionary algorithm to an LLM and receive a ready‑to‑use Python module that automatically selects mutation strengths in real time.
- Reduced need for hand‑crafted heuristics – Traditional parameter‑control strategies often require problem‑specific analysis; CWMs learn directly from data, lowering engineering effort.
- Accelerated prototyping – Because the approach works offline, teams can experiment with new fitness landscapes (e.g., custom hardware design spaces, hyper‑parameter search for ML models) without costly online RL training loops.
- Safety and interpretability – The generated code is human‑readable Python, allowing engineers to audit the model’s assumptions before deployment—unlike opaque neural policies.
- Potential for broader meta‑optimization – The same pipeline could be adapted to control other algorithmic knobs (population size, crossover rates, cooling schedules) across a wide range of stochastic optimization frameworks.
Limitations & Future Work
- Dependence on LLM quality – The fidelity of the world model hinges on the underlying language model; smaller or older models may produce buggy simulators.
- Scalability to high‑dimensional control spaces – The current greedy planner evaluates a modest set of discrete (k) values; extending to continuous or multi‑dimensional parameter spaces may require more sophisticated planning (e.g., Monte‑Carlo Tree Search).
- Assumption of stationary dynamics – The method presumes that the optimizer’s transition dynamics do not change dramatically during the run; highly non‑stationary problems could degrade performance.
- Benchmark coverage – Experiments focus on classic synthetic problems; real‑world industrial benchmarks (e.g., circuit layout, neural architecture search) remain to be tested.
Future research directions suggested by the authors include: integrating uncertainty quantification into the generated code, exploring hierarchical world models for multi‑level algorithm control, and coupling CWMs with online fine‑tuning to handle drifting problem instances.
Authors
- Camilo Chacón Sartori
- Guillem Rodríguez Corominas
Paper Information
- arXiv ID: 2602.22260v1
- Categories: cs.LG, cs.NE
- Published: February 25, 2026
- PDF: Download PDF