[Paper] SMOG: Scalable Meta-Learning for Multi-Objective Bayesian Optimization
Source: arXiv - 2601.22131v1
Overview
The paper introduces SMOG, a new meta‑learning framework that equips multi‑objective Bayesian optimization (MOBO) with a scalable, data‑driven prior. By leveraging historical data from related optimization problems, SMOG can “warm‑start” the search for Pareto‑optimal solutions, dramatically cutting the number of expensive black‑box evaluations needed in real‑world engineering and ML pipelines.
Key Contributions
- Unified meta‑learning + MOBO model – First method that learns a joint Gaussian‑process (GP) prior across many past tasks and multiple objectives simultaneously.
- Correlation‑aware multi‑output GP – Explicitly captures statistical dependencies between objectives, improving surrogate fidelity on the target problem.
- Closed‑form target prior with residual kernel – After conditioning on task metadata, SMOG produces an analytically tractable prior plus a flexible residual kernel that adapts to the new task.
- Scalable hierarchical training – Meta‑task GPs are trained once, cached, and reused, giving linear time‑complexity in the number of meta‑tasks.
- Plug‑and‑play with existing MOBO acquisition functions – No custom acquisition is required; SMOG’s surrogate can be dropped into standard tools such as Expected Hypervolume Improvement (EHVI).
Methodology
- Meta‑task collection – Gather a set of related optimization problems (e.g., tuning hyper‑parameters for different datasets). Each meta‑task provides a small set of input‑output pairs for all objectives.
- Multi‑output GP construction – Build a joint GP that models all objectives together, using a kernel that factorises into:
- A metadata kernel that ties together tasks sharing similar descriptors (e.g., dataset size, hardware specs).
- A residual multi‑output kernel that captures task‑specific nuances not explained by metadata.
- Conditioning on metadata – When a new target task arrives, its metadata is plugged into the GP. The model analytically integrates out uncertainty over the metadata, yielding a closed‑form prior for the target surrogate.
- Hierarchical training –
- Stage 1: Fit independent GPs for each meta‑task (parallelizable).
- Stage 2: Learn the hyper‑parameters of the metadata and residual kernels jointly, using the cached stage‑1 posteriors. This step scales linearly with the number of meta‑tasks.
- Optimization loop – Use the resulting surrogate inside any standard MOBO acquisition function (e.g., EHVI, Pareto‑frontier entropy). The acquisition selects the next black‑box evaluation, the data are added to the surrogate, and the loop repeats.
Results & Findings
| Experiment | Baseline | SMOG (meta‑learned) | Speed‑up |
|---|---|---|---|
| Synthetic 2‑objective benchmark (30 meta‑tasks) | Standard MOBO (no prior) | SMOG‑augmented MOBO | ~2.5× fewer evaluations to reach 90 % hypervolume |
| Hyper‑parameter tuning of a multi‑objective NN (accuracy vs. latency) across 10 datasets | Random search + MOBO | SMOG‑MOBO | 40 % reduction in total GPU hours |
| Real‑world engineering design (weight vs. strength) with 5 historic designs | Evolutionary MOEA | SMOG‑MOBO | Converged to Pareto front in half the budget |
Key take‑aways
- Meta‑learning the prior consistently reduces the number of expensive evaluations needed to approximate the Pareto front.
- The correlation‑aware kernel improves surrogate accuracy, especially when objectives are strongly coupled (e.g., accuracy vs. latency).
- Training time grows linearly with the number of meta‑tasks, confirming the claimed scalability.
Practical Implications
- Faster hyper‑parameter sweeps for multi‑objective ML models (e.g., balancing accuracy, inference time, and memory).
- Accelerated engineering design cycles where simulations are costly (CFD, structural analysis) and multiple performance metrics must be optimized.
- Continuous improvement pipelines: as new tasks are solved, their data automatically enrich the meta‑learning pool, making future optimizations progressively cheaper.
- Easy integration: Since SMOG outputs a standard GP posterior, existing BO libraries (BoTorch, GPyOpt, Emukit) can consume it without code changes.
Limitations & Future Work
- Metadata quality dependence – The approach assumes informative, low‑dimensional descriptors for each task; poor metadata can degrade the prior.
- Gaussian‑process scalability – Although meta‑training is linear, each GP still incurs cubic cost in its own data size; extremely large per‑task datasets may need sparse GP approximations.
- Limited empirical scope – Experiments focus on up to ~30 meta‑tasks; scaling to hundreds or thousands remains to be demonstrated.
- Future directions suggested by the authors include: extending SMOG to non‑Gaussian likelihoods (e.g., classification), exploring deep kernel learning for richer representations, and applying the framework to reinforcement‑learning policy search where objectives like reward and safety conflict.
Authors
- Leonard Papenmeier
- Petru Tighineanu
Paper Information
- arXiv ID: 2601.22131v1
- Categories: cs.LG
- Published: January 29, 2026
- PDF: Download PDF