[Paper] Improved Regret Guarantees for Online Mirror Descent using a Portfolio of Mirror Maps

Published: 2 months ago (February 13, 2026 at 01:37 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.13177v1

Overview

The paper Improved Regret Guarantees for Online Mirror Descent using a Portfolio of Mirror Maps shows that the classic Online Mirror Descent (OMD) algorithm can be made dramatically more efficient on high‑dimensional problems with sparse loss gradients. By letting the algorithm switch among a carefully designed family of “block‑norm” mirror maps, the authors prove polynomial‑in‑dimension reductions in regret compared with the standard OMD instantiations that use either pure (L_1) (Online Exponentiated Gradient) or pure (L_2) (Online Projected Gradient Descent) geometries.

Key Contributions

Block‑norm mirror maps: Introduce a new class of mirror maps based on block‑structured norms that interpolate between (L_1) and (L_2) more effectively than the usual (L_p) family.
Polynomial regret improvement: Construct explicit OCO instances where these block‑norm maps achieve a regret gain of (\Omega(d^{\alpha})) (for some (\alpha>0)) over both OEG and OPGD when losses are sparse.
Adaptive meta‑algorithm: Propose a multiplicative‑weights‑based “portfolio” method that dynamically selects the best block‑norm map online, without knowing the sparsity level in advance.
Negative result for naive switching: Prove that simply alternating between OEG and OPGD can lead to linear regret, highlighting the need for a principled selection strategy.
Theoretical guarantees: Provide adaptive regret bounds that automatically match the best fixed block‑norm in hindsight, effectively “tuning” OMD to the unknown sparsity pattern.

Methodology

Mirror‑map design
- The authors define a family of uniform block norms (|\cdot|_{(k)}) that treat groups of coordinates as blocks of size (k).
- The associated Bregman divergences serve as mirror maps for OMD, blending the sparsity‑favoring behavior of (L_1) (small blocks) with the smoothness of (L_2) (large blocks).
Hard‑instance construction
- They build a synthetic OCO problem in (\mathbb{R}^d) where each loss vector has only a few non‑zero entries (sparsity (s \ll d)).
- By analyzing the geometry of the block‑norm Bregman divergences, they show that the regret of OMD with a block‑norm mirror map scales like (\tilde O(\sqrt{sT})) instead of (\tilde O(\sqrt{dT})) for OEG/OPGD.
Meta‑learning via multiplicative weights
- Treat each block‑norm choice as an “expert”.
- Run a standard multiplicative‑weights (MW) algorithm over these experts, feeding the loss incurred by each OMD instance back to MW.
- The MW update automatically up‑weights the block‑norm that best matches the observed sparsity, while keeping a regret overhead of only (\tilde O(\sqrt{T\log K})) where (K) is the number of block‑norm candidates.
Analysis
- Combine OMD regret bounds for each mirror map with the MW regret guarantee to obtain an overall adaptive regret bound that is within a small factor of the best possible block‑norm in hindsight.

Results & Findings

Setting	Regret (previous OMD)	Regret (block‑norm OMD)	Improvement
Sparse losses, known sparsity (s)	(\tilde O(\sqrt{dT})) (OEG/OPGD)	(\tilde O(\sqrt{sT}))	Polynomial in (d/s)
Sparse losses, unknown sparsity	Same as above (if naïvely switch) → linear regret	(\tilde O(\sqrt{sT} + \sqrt{T\log K})) (meta‑algorithm)	Guarantees adaptivity without prior knowledge

Polynomial gain: For example, with (s = d^{1/2}), the regret drops from (\tilde O(d^{1/2}\sqrt{T})) to (\tilde O(d^{1/4}\sqrt{T})).
Robustness: The meta‑algorithm never performs worse than the worst single block‑norm by more than the MW overhead, which is negligible for large (T).

Practical Implications

High‑dimensional ML pipelines – Many online learning tasks (e.g., click‑through‑rate prediction, recommendation systems) involve feature vectors with millions of dimensions but only a handful active per round. Using block‑norm OMD can cut the regret (and thus the cumulative loss) dramatically, translating to better predictive performance.
Sparse gradient descent in deep learning – When training large models with sparse updates (e.g., embedding tables, pruning), a block‑norm mirror map can be plugged into existing OMD‑style optimizers (AdaGrad, Adam) to better respect the underlying sparsity structure.
Automated hyper‑parameter tuning – The multiplicative‑weights portfolio removes the need for a practitioner to hand‑pick the “right” norm or learning‑rate schedule; the algorithm self‑adjusts based on observed gradients.
Systems with limited memory – Block‑norm Bregman divergences often admit efficient projection or proximal steps (they decompose across blocks), making the approach practical for real‑time services.
Extension to bandits & reinforcement learning – The same portfolio idea can be applied where the loss feedback is partial, offering a pathway to sparsity‑aware exploration strategies.

Limitations & Future Work

Assumption of uniform block size: The theory focuses on equal‑sized blocks; heterogeneous or data‑driven block partitions may yield further gains but are not analyzed.
Computational overhead of the portfolio: Maintaining multiple OMD instances in parallel incurs extra CPU/memory cost; scaling to thousands of experts needs smarter pruning or hierarchical selection.
Empirical validation: The paper provides rigorous proofs and synthetic experiments; real‑world benchmarks (e.g., ad‑click logs, NLP embeddings) are left for future studies.
Beyond convex losses: Extending the block‑norm mirror‑map selection to non‑convex or stochastic settings (e.g., deep learning) remains an open challenge.

Authors

Swati Gupta
Jai Moondra
Mohit Singh

Paper Information

arXiv ID: 2602.13177v1
Categories: math.OC, cs.DS, cs.LG
Published: February 13, 2026
PDF: Download PDF

[Paper] Improved Regret Guarantees for Online Mirror Descent using a Portfolio of Mirror Maps

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Selection of CMIP6 Models for Regional Precipitation Projection and Climate Change Assessment in the Jhelum and Chenab River Basins