[Paper] Improved Regret Guarantees for Online Mirror Descent using a Portfolio of Mirror Maps

Published: (February 13, 2026 at 01:37 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.13177v1

Overview

The paper Improved Regret Guarantees for Online Mirror Descent using a Portfolio of Mirror Maps shows that the classic Online Mirror Descent (OMD) algorithm can be made dramatically more efficient on high‑dimensional problems with sparse loss gradients. By letting the algorithm switch among a carefully designed family of “block‑norm” mirror maps, the authors prove polynomial‑in‑dimension reductions in regret compared with the standard OMD instantiations that use either pure (L_1) (Online Exponentiated Gradient) or pure (L_2) (Online Projected Gradient Descent) geometries.

Key Contributions

  • Block‑norm mirror maps: Introduce a new class of mirror maps based on block‑structured norms that interpolate between (L_1) and (L_2) more effectively than the usual (L_p) family.
  • Polynomial regret improvement: Construct explicit OCO instances where these block‑norm maps achieve a regret gain of (\Omega(d^{\alpha})) (for some (\alpha>0)) over both OEG and OPGD when losses are sparse.
  • Adaptive meta‑algorithm: Propose a multiplicative‑weights‑based “portfolio” method that dynamically selects the best block‑norm map online, without knowing the sparsity level in advance.
  • Negative result for naive switching: Prove that simply alternating between OEG and OPGD can lead to linear regret, highlighting the need for a principled selection strategy.
  • Theoretical guarantees: Provide adaptive regret bounds that automatically match the best fixed block‑norm in hindsight, effectively “tuning” OMD to the unknown sparsity pattern.

Methodology

  1. Mirror‑map design

    • The authors define a family of uniform block norms (|\cdot|_{(k)}) that treat groups of coordinates as blocks of size (k).
    • The associated Bregman divergences serve as mirror maps for OMD, blending the sparsity‑favoring behavior of (L_1) (small blocks) with the smoothness of (L_2) (large blocks).
  2. Hard‑instance construction

    • They build a synthetic OCO problem in (\mathbb{R}^d) where each loss vector has only a few non‑zero entries (sparsity (s \ll d)).
    • By analyzing the geometry of the block‑norm Bregman divergences, they show that the regret of OMD with a block‑norm mirror map scales like (\tilde O(\sqrt{sT})) instead of (\tilde O(\sqrt{dT})) for OEG/OPGD.
  3. Meta‑learning via multiplicative weights

    • Treat each block‑norm choice as an “expert”.
    • Run a standard multiplicative‑weights (MW) algorithm over these experts, feeding the loss incurred by each OMD instance back to MW.
    • The MW update automatically up‑weights the block‑norm that best matches the observed sparsity, while keeping a regret overhead of only (\tilde O(\sqrt{T\log K})) where (K) is the number of block‑norm candidates.
  4. Analysis

    • Combine OMD regret bounds for each mirror map with the MW regret guarantee to obtain an overall adaptive regret bound that is within a small factor of the best possible block‑norm in hindsight.

Results & Findings

SettingRegret (previous OMD)Regret (block‑norm OMD)Improvement
Sparse losses, known sparsity (s)(\tilde O(\sqrt{dT})) (OEG/OPGD)(\tilde O(\sqrt{sT}))Polynomial in (d/s)
Sparse losses, unknown sparsitySame as above (if naïvely switch) → linear regret(\tilde O(\sqrt{sT} + \sqrt{T\log K})) (meta‑algorithm)Guarantees adaptivity without prior knowledge
  • Polynomial gain: For example, with (s = d^{1/2}), the regret drops from (\tilde O(d^{1/2}\sqrt{T})) to (\tilde O(d^{1/4}\sqrt{T})).
  • Robustness: The meta‑algorithm never performs worse than the worst single block‑norm by more than the MW overhead, which is negligible for large (T).

Practical Implications

  1. High‑dimensional ML pipelines – Many online learning tasks (e.g., click‑through‑rate prediction, recommendation systems) involve feature vectors with millions of dimensions but only a handful active per round. Using block‑norm OMD can cut the regret (and thus the cumulative loss) dramatically, translating to better predictive performance.

  2. Sparse gradient descent in deep learning – When training large models with sparse updates (e.g., embedding tables, pruning), a block‑norm mirror map can be plugged into existing OMD‑style optimizers (AdaGrad, Adam) to better respect the underlying sparsity structure.

  3. Automated hyper‑parameter tuning – The multiplicative‑weights portfolio removes the need for a practitioner to hand‑pick the “right” norm or learning‑rate schedule; the algorithm self‑adjusts based on observed gradients.

  4. Systems with limited memory – Block‑norm Bregman divergences often admit efficient projection or proximal steps (they decompose across blocks), making the approach practical for real‑time services.

  5. Extension to bandits & reinforcement learning – The same portfolio idea can be applied where the loss feedback is partial, offering a pathway to sparsity‑aware exploration strategies.

Limitations & Future Work

  • Assumption of uniform block size: The theory focuses on equal‑sized blocks; heterogeneous or data‑driven block partitions may yield further gains but are not analyzed.
  • Computational overhead of the portfolio: Maintaining multiple OMD instances in parallel incurs extra CPU/memory cost; scaling to thousands of experts needs smarter pruning or hierarchical selection.
  • Empirical validation: The paper provides rigorous proofs and synthetic experiments; real‑world benchmarks (e.g., ad‑click logs, NLP embeddings) are left for future studies.
  • Beyond convex losses: Extending the block‑norm mirror‑map selection to non‑convex or stochastic settings (e.g., deep learning) remains an open challenge.

Authors

  • Swati Gupta
  • Jai Moondra
  • Mohit Singh

Paper Information

  • arXiv ID: 2602.13177v1
  • Categories: math.OC, cs.DS, cs.LG
  • Published: February 13, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »