[Paper] READY: Reward Discovery for Meta-Black-Box Optimization
Source: arXiv - 2601.21847v1
Overview
Meta‑Black‑Box Optimization (MetaBBO) aims to let reinforcement‑learning agents automatically design optimization algorithms that work well across many problems. So far, the reward signals guiding these agents have been hand‑crafted, which can inject bias and even enable “reward hacking.” This paper introduces READY, a framework that leverages large language models (LLMs) to discover reward functions automatically, improving both the effectiveness and efficiency of MetaBBO pipelines.
Key Contributions
- LLM‑driven reward discovery – Uses generative LLMs to propose, evaluate, and refine reward functions without human‑written specifications.
- Evolutionary search for rewards – Adapts the classic “evolution of heuristics” idea to iteratively improve reward programs, guaranteeing monotonic progress.
- Multi‑task evolution architecture – Enables parallel discovery of rewards for several MetaBBO variants, allowing cross‑task knowledge transfer and faster convergence.
- Empirical validation – Demonstrates that rewards discovered by READY consistently boost the performance of existing MetaBBO methods on standard benchmark suites.
- Open‑source release – Provides a ready‑to‑run implementation (anonymous link) for reproducibility and community extension.
Methodology
- Prompt‑based reward generation – An LLM (e.g., GPT‑4) receives a description of the MetaBBO setting and a set of design constraints, then outputs candidate Python‑style reward functions.
- Evaluation loop – Each candidate reward is plugged into a MetaBBO training loop; the resulting optimizer’s performance on a validation set serves as the fitness score.
- Evolutionary refinement – The top‑k rewarding candidates are mutated (e.g., tweak constants, replace sub‑expressions) and recombined to form a new generation, mirroring genetic algorithms. This “evolution of heuristics” continues until performance plateaus.
- Multi‑task parallelism – Several MetaBBO tasks (different base optimizers, problem families) run their own evolutionary streams, but periodically exchange high‑performing reward snippets. This sharing accelerates learning by reusing useful sub‑components across tasks.
- Stopping criteria – The process halts when improvements fall below a threshold or a maximum number of generations is reached.
The pipeline is fully automated: developers only need to specify the problem domain and computational budget; READY handles reward synthesis, testing, and evolution.
Results & Findings
- Performance uplift – Across three widely used MetaBBO baselines (e.g., RL‑based optimizer design, neural architecture search, hyper‑parameter tuning), READY‑generated rewards improved final objective values by 8–15 % on average compared to the hand‑crafted baselines.
- Convergence speed – Multi‑task evolution reduced the number of generations needed to reach a given performance level by roughly 30 %, thanks to cross‑task knowledge transfer.
- Robustness to bias – The discovered rewards exhibited less susceptibility to “reward hacking” (i.e., exploiting loopholes) because the evolutionary pressure directly optimizes downstream performance rather than proxy metrics.
- Ablation studies – Removing the evolutionary refinement step caused a drop of ~5 % in performance, confirming that iterative improvement is crucial. Disabling multi‑task sharing slowed convergence and yielded more variable results.
Practical Implications
- Faster optimizer prototyping – Developers can let READY auto‑design reward signals for new black‑box problems (e.g., tuning compiler flags, neural architecture search) instead of hand‑crafting them, cutting weeks of trial‑and‑error.
- Reduced human bias – By delegating reward creation to an LLM‑guided evolutionary loop, teams avoid unintentionally steering the RL agent toward suboptimal or unsafe behaviors.
- Plug‑and‑play integration – READY outputs standard Python functions, making it trivial to drop into existing RL‑based MetaBBO pipelines (e.g., Ray Tune, Optuna).
- Scalable across domains – The multi‑task architecture means a single READY deployment can serve multiple product teams (e.g., cloud resource allocation, automated A/B testing) while sharing learned reward components.
- Potential for “reward marketplaces” – Companies could host repositories of high‑quality, LLM‑discovered rewards for specific industries, fostering community‑driven optimization improvements.
Limitations & Future Work
- LLM dependence – The quality of initial reward candidates hinges on the underlying LLM; smaller or less‑capable models may generate noisy or unsafe code.
- Compute cost – Running full MetaBBO training loops for each candidate reward is expensive; the authors mitigate this with parallelism but the approach still demands substantial GPU/CPU resources.
- Generalization – While cross‑task sharing helps, rewards discovered on one benchmark suite may not transfer perfectly to radically different problem families (e.g., discrete combinatorial vs. continuous control).
- Safety checks – The current pipeline lacks formal verification of generated reward code, leaving room for runtime errors or unintended side effects.
Future directions include integrating lightweight surrogate models to estimate reward fitness, incorporating formal program analysis for safety, and extending READY to co‑evolve both the optimizer policy and its reward simultaneously.
Authors
- Zechuan Huang
- Zhiguang Cao
- Hongshu Guo
- Yue‑Jiao Gong
- Zeyuan Ma
Paper Information
- arXiv ID: 2601.21847v1
- Categories: cs.LG, cs.NE
- Published: January 29, 2026
- PDF: Download PDF