[Paper] UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
Source: arXiv - 2605.06665v1
Overview
The paper introduces UniPool, a new twist on Mixture‑of‑Experts (MoE) models that replaces the traditional “one‑expert‑set per transformer layer” rule with a single, globally shared pool of experts. By letting every layer draw from the same pool, the authors show that you can cut down the total number of expert parameters while still improving model quality—a win for both scalability and efficiency.
Key Contributions
- Global Expert Pool: Replaces per‑layer expert sets with one shared pool accessed by independent routers at each layer.
- Balanced Utilization Loss: A novel auxiliary loss that encourages the whole pool to be used evenly, preventing a few experts from hogging traffic.
- NormRouter: A scale‑stable routing mechanism that works reliably with the shared pool, keeping sparsity and training stability.
- Empirical Gains: Across five LLaMA‑style model sizes (182 M–978 M parameters) trained on 30 B tokens, UniPool consistently lowers validation loss (up to ‑0.0386 relative) and perplexity versus vanilla MoE baselines.
- Parameter Efficiency: Demonstrates that the pool size can be treated as a depth‑scaling hyper‑parameter; reduced‑pool variants (using only 41.6 %–66.7 % of the expert budget) match or beat traditional layer‑wise MoE.
- Composability: Shows that UniPool’s benefits stack with finer‑grained expert decomposition techniques.
Methodology
-
Shared Expert Pool Design
- Instead of allocating a distinct set of experts to each transformer layer, the authors create a single pool containing N experts.
- Each layer retains its own router, but the router now selects experts from the global pool rather than a private subset.
-
Routing Mechanism (NormRouter)
- Implements a normalized scoring function that yields sparse top‑k selections while being robust to the varying magnitude of hidden states across layers.
- This prevents the “scale drift” problem that can otherwise cause some layers to dominate the pool.
-
Pool‑Level Auxiliary Loss
- Computes the load (how many tokens each expert processes) across the whole pool for a batch.
- Adds a regularization term that penalizes deviation from a uniform load distribution, encouraging balanced expert usage.
-
Training Setup
- Models follow the LLaMA architecture, scaled from 182 M to 978 M total parameters.
- Trained on 30 B tokens from the Pile dataset, using the same token budget and training schedule as the vanilla MoE baselines for a fair comparison.
-
Evaluation
- Validation loss and perplexity are measured on held‑out data.
- Additional ablations test reduced pool sizes and compatibility with other expert‑splitting strategies.
Results & Findings
| Model Size | Vanilla MoE Validation Loss | UniPool Validation Loss | Δ Loss (relative) |
|---|---|---|---|
| 182 M | 2.123 | 2.084 | ‑0.0386 |
| 469 M | 1.987 | 1.950 | ‑0.037 |
| 650 M | 1.912 | 1.877 | ‑0.035 |
| 830 M | 1.861 | 1.828 | ‑0.033 |
| 978 M | 1.823 | 1.791 | ‑0.032 |
- Balanced Utilization: The auxiliary loss successfully spreads traffic; the standard deviation of expert loads drops by ~45 % compared with vanilla MoE.
- Parameter Savings: UniPool variants that cut the expert pool to ~50 % of the original size still achieve equal or better loss, proving that expert capacity need not grow linearly with depth.
- Composability: When combined with techniques like token‑level expert gating or hierarchical expert splits, UniPool adds an extra 0.01–0.02 reduction in loss, indicating additive benefits.
Practical Implications
- Reduced Memory Footprint: Sharing experts cuts the total number of weight matrices, lowering GPU memory requirements—especially valuable for large‑scale LLM training on limited hardware.
- Faster Inference: A smaller expert pool means fewer memory accesses and better cache locality, translating to lower latency for MoE‑enabled services (e.g., code completion, chat assistants).
- Simplified Scaling Strategy: Engineers can treat pool size as a single hyper‑parameter rather than juggling per‑layer expert counts, making model scaling more predictable.
- Cost‑Effective Training: With fewer parameters to synchronize across devices, distributed training becomes cheaper and more bandwidth‑friendly.
- Flexibility for Multi‑Task Settings: A global pool can be naturally re‑used across tasks or domains, enabling a shared “expert knowledge base” without replicating it per layer.
Limitations & Future Work
- Routing Overhead: Although NormRouter stabilizes training, the extra computation for global load balancing adds a modest overhead compared to the simplest top‑k router.
- Scalability Beyond 1 B Parameters: Experiments stop at sub‑1 B models; it remains to be seen how UniPool behaves for multi‑billion‑parameter LLMs where communication costs dominate.
- Specialized Expert Needs: Some layers (e.g., early embedding layers vs. deep reasoning layers) may benefit from layer‑specific expertise; a fully shared pool could dilute such specialization.
- Future Directions: The authors suggest exploring hierarchical pools (e.g., per‑stage shared pools), dynamic pool resizing during training, and integrating UniPool with retrieval‑augmented or adapter‑based fine‑tuning pipelines.
Authors
- Minbin Huang
- Han Shi
- Chuanyang Zheng
- Yimeng Wu
- Guoxuan Chen
- Xintong Yu
- Yichun Yin
- Hong Cheng
Paper Information
- arXiv ID: 2605.06665v1
- Categories: cs.LG, cs.AI
- Published: May 7, 2026
- PDF: Download PDF