[Paper] UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Published: 3 days ago (May 7, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06665v1

Overview

The paper introduces UniPool, a new twist on Mixture‑of‑Experts (MoE) models that replaces the traditional “one‑expert‑set per transformer layer” rule with a single, globally shared pool of experts. By letting every layer draw from the same pool, the authors show that you can cut down the total number of expert parameters while still improving model quality—a win for both scalability and efficiency.

Key Contributions

Global Expert Pool: Replaces per‑layer expert sets with one shared pool accessed by independent routers at each layer.
Balanced Utilization Loss: A novel auxiliary loss that encourages the whole pool to be used evenly, preventing a few experts from hogging traffic.
NormRouter: A scale‑stable routing mechanism that works reliably with the shared pool, keeping sparsity and training stability.
Empirical Gains: Across five LLaMA‑style model sizes (182 M–978 M parameters) trained on 30 B tokens, UniPool consistently lowers validation loss (up to ‑0.0386 relative) and perplexity versus vanilla MoE baselines.
Parameter Efficiency: Demonstrates that the pool size can be treated as a depth‑scaling hyper‑parameter; reduced‑pool variants (using only 41.6 %–66.7 % of the expert budget) match or beat traditional layer‑wise MoE.
Composability: Shows that UniPool’s benefits stack with finer‑grained expert decomposition techniques.

Methodology

Shared Expert Pool Design
- Instead of allocating a distinct set of experts to each transformer layer, the authors create a single pool containing N experts.
- Each layer retains its own router, but the router now selects experts from the global pool rather than a private subset.
Routing Mechanism (NormRouter)
- Implements a normalized scoring function that yields sparse top‑k selections while being robust to the varying magnitude of hidden states across layers.
- This prevents the “scale drift” problem that can otherwise cause some layers to dominate the pool.
Pool‑Level Auxiliary Loss
- Computes the load (how many tokens each expert processes) across the whole pool for a batch.
- Adds a regularization term that penalizes deviation from a uniform load distribution, encouraging balanced expert usage.
Training Setup
- Models follow the LLaMA architecture, scaled from 182 M to 978 M total parameters.
- Trained on 30 B tokens from the Pile dataset, using the same token budget and training schedule as the vanilla MoE baselines for a fair comparison.
Evaluation
- Validation loss and perplexity are measured on held‑out data.
- Additional ablations test reduced pool sizes and compatibility with other expert‑splitting strategies.

Results & Findings

Model Size	Vanilla MoE Validation Loss	UniPool Validation Loss	Δ Loss (relative)
182 M	2.123	2.084	‑0.0386
469 M	1.987	1.950	‑0.037
650 M	1.912	1.877	‑0.035
830 M	1.861	1.828	‑0.033
978 M	1.823	1.791	‑0.032

Balanced Utilization: The auxiliary loss successfully spreads traffic; the standard deviation of expert loads drops by ~45 % compared with vanilla MoE.
Parameter Savings: UniPool variants that cut the expert pool to ~50 % of the original size still achieve equal or better loss, proving that expert capacity need not grow linearly with depth.
Composability: When combined with techniques like token‑level expert gating or hierarchical expert splits, UniPool adds an extra 0.01–0.02 reduction in loss, indicating additive benefits.

Practical Implications

Reduced Memory Footprint: Sharing experts cuts the total number of weight matrices, lowering GPU memory requirements—especially valuable for large‑scale LLM training on limited hardware.
Faster Inference: A smaller expert pool means fewer memory accesses and better cache locality, translating to lower latency for MoE‑enabled services (e.g., code completion, chat assistants).
Simplified Scaling Strategy: Engineers can treat pool size as a single hyper‑parameter rather than juggling per‑layer expert counts, making model scaling more predictable.
Cost‑Effective Training: With fewer parameters to synchronize across devices, distributed training becomes cheaper and more bandwidth‑friendly.
Flexibility for Multi‑Task Settings: A global pool can be naturally re‑used across tasks or domains, enabling a shared “expert knowledge base” without replicating it per layer.

Limitations & Future Work

Routing Overhead: Although NormRouter stabilizes training, the extra computation for global load balancing adds a modest overhead compared to the simplest top‑k router.
Scalability Beyond 1 B Parameters: Experiments stop at sub‑1 B models; it remains to be seen how UniPool behaves for multi‑billion‑parameter LLMs where communication costs dominate.
Specialized Expert Needs: Some layers (e.g., early embedding layers vs. deep reasoning layers) may benefit from layer‑specific expertise; a fully shared pool could dilute such specialization.
Future Directions: The authors suggest exploring hierarchical pools (e.g., per‑stage shared pools), dynamic pool resizing during training, and integrating UniPool with retrieval‑augmented or adapter‑based fine‑tuning pipelines.

Authors

Minbin Huang
Han Shi
Chuanyang Zheng
Yimeng Wu
Guoxuan Chen
Xintong Yu
Yichun Yin
Hong Cheng

Paper Information

arXiv ID: 2605.06665v1
Categories: cs.LG, cs.AI
Published: May 7, 2026
PDF: Download PDF

[Paper] UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction