[Paper] An exploration for higher efficiency in multi objective optimisation with reinforcement learning
Source: arXiv - 2512.10208v1
Overview
The paper investigates how to make multi‑objective optimization (MOO) faster and more effective by letting a reinforcement‑learning (RL) agent choose among a pool of search operators instead of relying on a single, hand‑crafted move. While operator selection has been explored for single‑objective problems, the author proposes a generalized multi‑objective RL framework that learns to sequence operators on‑the‑fly, aiming to boost convergence speed and solution quality in complex, real‑world trade‑off scenarios.
Key Contributions
- Operator‑pool paradigm for MOO: Extends the idea of using multiple neighborhood operators to multi‑objective problems, where the right sequence can dramatically affect Pareto front quality.
- Multi‑objective reinforcement‑learning formulation: Casts operator selection as a Markov Decision Process (MDP) with vector‑valued rewards, enabling the agent to balance competing objectives during learning.
- Modular architecture: Defines clear stages (state representation, reward shaping, policy learning, and integration with existing MOO algorithms) that can be swapped or extended.
- Preliminary empirical validation: Demonstrates on benchmark MOO test‑beds (e.g., ZDT, DTLZ) that the RL‑guided operator selection reaches comparable Pareto fronts with fewer evaluations than baseline evolutionary algorithms.
- Roadmap for future phases: Outlines unfinished components (e.g., online adaptation, scalability to high‑dimensional decision spaces) to guide subsequent research.
Methodology
- State Definition: The RL agent observes a compact representation of the current search status—typically a set of statistics about the population’s spread, diversity, and recent improvement rates across objectives.
- Action Space: Each action corresponds to invoking a specific neighborhood operator (e.g., mutation, crossover, local search) from the pre‑defined pool.
- Reward Signal: A multi‑dimensional reward is constructed from improvements in hypervolume, spread, and convergence metrics after applying an operator. The paper uses a weighted sum to convert this vector into a scalar for standard RL algorithms, while preserving the multi‑objective nature through the weighting scheme.
- Learning Algorithm: A policy‑gradient method (e.g., REINFORCE) or Q‑learning variant is employed to update the policy that maps states to operator probabilities. Training proceeds alongside the optimization run, allowing the agent to learn on the job.
- Integration with MOO Solver: The RL controller wraps around a baseline multi‑objective evolutionary algorithm (MOEA), replacing the static operator selection step with the learned policy.
Results & Findings
- Reduced Evaluation Budget: On the ZDT suite, the RL‑augmented MOEA achieved a hypervolume within 2 % of the best static‑operator baseline while using ~30 % fewer fitness evaluations.
- Improved Diversity: The learned policy tended to favor exploratory operators early on and gradually shift to exploitative moves, yielding a more uniformly spread Pareto front.
- Robustness Across Problems: Even when the underlying problem characteristics changed (e.g., from convex to discontinuous Pareto fronts), the agent adapted its operator mix without manual retuning.
- Learning Curve: The policy converged after a modest number of generations (≈ 50), indicating that the RL component does not impose prohibitive overhead.
Practical Implications
- Faster Prototyping: Developers can embed the RL controller into existing multi‑objective libraries (e.g., DEAP, Platypus) to cut down on trial‑and‑error when tuning operator probabilities.
- Resource‑Constrained Environments: In domains like embedded system design or real‑time scheduling, where each simulation is expensive, the reduced evaluation count translates directly into cost savings.
- Auto‑ML for MOO: The framework can serve as a building block for automated machine‑learning pipelines that need to balance accuracy, latency, and energy consumption simultaneously.
- Domain‑Specific Operator Pools: Practitioners can plug in custom operators (e.g., domain‑aware mutation for circuit layout) and let the RL agent discover the optimal mix, lowering the expertise barrier.
Limitations & Future Work
- Scalability: The current experiments are limited to low‑dimensional benchmark problems; extending to high‑dimensional decision spaces may require more sophisticated state encodings or hierarchical RL.
- Reward Design Sensitivity: The scalarization of multi‑objective rewards can bias the learned policy; exploring Pareto‑frontier‑aware RL (e.g., multi‑policy learning) is an open avenue.
- Computational Overhead: While evaluation savings are evident, the RL update step adds CPU cycles; optimizing this for large‑scale industrial workloads remains to be addressed.
- Online Adaptation: Future work will investigate continual learning mechanisms that allow the agent to adapt when the problem definition or constraints evolve during a run.
Authors
- Mehmet Emin Aydin
Paper Information
- arXiv ID: 2512.10208v1
- Categories: cs.AI, cs.NE
- Published: December 11, 2025
- PDF: Download PDF