[Paper] READY: Reward Discovery for Meta-Black-Box Optimization

Published: 3 months ago (January 29, 2026 at 10:23 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.21847v1

Overview

Meta‑Black‑Box Optimization (MetaBBO) aims to let reinforcement‑learning agents automatically design optimization algorithms that work well across many problems. So far, the reward signals guiding these agents have been hand‑crafted, which can inject bias and even enable “reward hacking.” This paper introduces READY, a framework that leverages large language models (LLMs) to discover reward functions automatically, improving both the effectiveness and efficiency of MetaBBO pipelines.

Key Contributions

LLM‑driven reward discovery – Uses generative LLMs to propose, evaluate, and refine reward functions without human‑written specifications.
Evolutionary search for rewards – Adapts the classic “evolution of heuristics” idea to iteratively improve reward programs, guaranteeing monotonic progress.
Multi‑task evolution architecture – Enables parallel discovery of rewards for several MetaBBO variants, allowing cross‑task knowledge transfer and faster convergence.
Empirical validation – Demonstrates that rewards discovered by READY consistently boost the performance of existing MetaBBO methods on standard benchmark suites.
Open‑source release – Provides a ready‑to‑run implementation (anonymous link) for reproducibility and community extension.

Methodology

Prompt‑based reward generation – An LLM (e.g., GPT‑4) receives a description of the MetaBBO setting and a set of design constraints, then outputs candidate Python‑style reward functions.
Evaluation loop – Each candidate reward is plugged into a MetaBBO training loop; the resulting optimizer’s performance on a validation set serves as the fitness score.
Evolutionary refinement – The top‑k rewarding candidates are mutated (e.g., tweak constants, replace sub‑expressions) and recombined to form a new generation, mirroring genetic algorithms. This “evolution of heuristics” continues until performance plateaus.
Multi‑task parallelism – Several MetaBBO tasks (different base optimizers, problem families) run their own evolutionary streams, but periodically exchange high‑performing reward snippets. This sharing accelerates learning by reusing useful sub‑components across tasks.
Stopping criteria – The process halts when improvements fall below a threshold or a maximum number of generations is reached.

The pipeline is fully automated: developers only need to specify the problem domain and computational budget; READY handles reward synthesis, testing, and evolution.

Results & Findings

Performance uplift – Across three widely used MetaBBO baselines (e.g., RL‑based optimizer design, neural architecture search, hyper‑parameter tuning), READY‑generated rewards improved final objective values by 8–15 % on average compared to the hand‑crafted baselines.
Convergence speed – Multi‑task evolution reduced the number of generations needed to reach a given performance level by roughly 30 %, thanks to cross‑task knowledge transfer.
Robustness to bias – The discovered rewards exhibited less susceptibility to “reward hacking” (i.e., exploiting loopholes) because the evolutionary pressure directly optimizes downstream performance rather than proxy metrics.
Ablation studies – Removing the evolutionary refinement step caused a drop of ~5 % in performance, confirming that iterative improvement is crucial. Disabling multi‑task sharing slowed convergence and yielded more variable results.

Practical Implications

Faster optimizer prototyping – Developers can let READY auto‑design reward signals for new black‑box problems (e.g., tuning compiler flags, neural architecture search) instead of hand‑crafting them, cutting weeks of trial‑and‑error.
Reduced human bias – By delegating reward creation to an LLM‑guided evolutionary loop, teams avoid unintentionally steering the RL agent toward suboptimal or unsafe behaviors.
Plug‑and‑play integration – READY outputs standard Python functions, making it trivial to drop into existing RL‑based MetaBBO pipelines (e.g., Ray Tune, Optuna).
Scalable across domains – The multi‑task architecture means a single READY deployment can serve multiple product teams (e.g., cloud resource allocation, automated A/B testing) while sharing learned reward components.
Potential for “reward marketplaces” – Companies could host repositories of high‑quality, LLM‑discovered rewards for specific industries, fostering community‑driven optimization improvements.

Limitations & Future Work

LLM dependence – The quality of initial reward candidates hinges on the underlying LLM; smaller or less‑capable models may generate noisy or unsafe code.
Compute cost – Running full MetaBBO training loops for each candidate reward is expensive; the authors mitigate this with parallelism but the approach still demands substantial GPU/CPU resources.
Generalization – While cross‑task sharing helps, rewards discovered on one benchmark suite may not transfer perfectly to radically different problem families (e.g., discrete combinatorial vs. continuous control).
Safety checks – The current pipeline lacks formal verification of generated reward code, leaving room for runtime errors or unintended side effects.

Future directions include integrating lightweight surrogate models to estimate reward fitness, incorporating formal program analysis for safety, and extending READY to co‑evolve both the optimizer policy and its reward simultaneously.

Authors

Zechuan Huang
Zhiguang Cao
Hongshu Guo
Yue‑Jiao Gong
Zeyuan Ma

Paper Information

arXiv ID: 2601.21847v1
Categories: cs.LG, cs.NE
Published: January 29, 2026
PDF: Download PDF

[Paper] READY: Reward Discovery for Meta-Black-Box Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound