[Paper] DirMoE: Dirichlet-routed Mixture of Experts

Published: (February 9, 2026 at 01:45 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.09001v1

Overview

The paper introduces DirMoE, a new routing mechanism for Mixture‑of‑Experts (MoE) models that replaces the widely used non‑differentiable Top‑k + Softmax with a fully differentiable, probabilistic approach. By separating which experts to activate from how much each selected expert should contribute, DirMoE achieves better performance, tighter control over sparsity, and stronger expert specialization—key factors for scaling language models in production.

Key Contributions

  • Dirichlet‑routed routing: A novel, end‑to‑end differentiable router built on a Dirichlet variational auto‑encoder (VAE) that decouples expert selection (Bernoulli) from contribution weighting (Dirichlet).
  • Gumbel‑Sigmoid + implicit reparameterization: Enables gradient flow through the binary selection of experts and the continuous Dirichlet weights without resorting to hard Top‑k tricks.
  • Variational ELBO with explicit sparsity term: Directly penalizes the expected number of active experts, giving precise control over model sparsity during training.
  • Curriculum‑style hyper‑parameter schedule: Guides the router from an exploratory phase (many experts active) to a definitive phase (few, well‑chosen experts), improving convergence stability.
  • Empirical gains: Matches or outperforms state‑of‑the‑art MoE routers on benchmark language modeling tasks while showing higher expert specialization.

Methodology

  1. Problem decomposition

    • Expert selection: Modeled as a set of independent Bernoulli variables indicating whether each expert is turned on.
    • Contribution allocation: Conditioned on the selected experts, a Dirichlet distribution assigns a probability simplex over their contributions.
  2. Variational formulation

    • The router is treated as a latent variable model. The encoder (a lightweight feed‑forward network) predicts the parameters of the Bernoulli and Dirichlet distributions for each token.
    • Training maximizes the Evidence Lower Bound (ELBO): a reconstruction term (standard MoE loss) plus KL divergences that regularize the latent distributions and a sparsity penalty that directly controls the expected number of active experts.
  3. Differentiable sampling

    • Gumbel‑Sigmoid relaxation approximates the Bernoulli draws, providing a smooth, differentiable proxy for the binary on/off decisions.
    • Implicit reparameterization gradients are used for the Dirichlet draws, allowing back‑propagation through the contribution weights without explicit reparameterization.
  4. Training schedule

    • Early epochs use a high temperature for the Gumbel‑Sigmoid and a low concentration for the Dirichlet, encouraging exploration of many experts.
    • Over time, temperature is annealed and concentration increased, forcing the router to commit to a sparse, deterministic set of experts.

Results & Findings

Model / RouterPerplexity (Wiki‑103)FLOPs (per token)Expert Utilization (avg.)
Standard Top‑k + Softmax (k=2)18.71.0×2
Switch Transformer (k=1)19.30.9×1
DirMoE (k≈2 in expectation)17.91.0×2
DirMoE (k≈1.5)18.10.95×1.5
  • Performance: DirMoE consistently reduces perplexity compared with the baseline routers, even when matching the same FLOP budget.
  • Sparsity control: The explicit sparsity term lets practitioners target a precise average number of active experts, something that is only approximate with Top‑k.
  • Expert specialization: Analysis of activation patterns shows higher mutual information between tokens and their selected experts, indicating that experts learn more distinct linguistic sub‑tasks.

Practical Implications

  • Scalable deployment: Because routing is fully differentiable, DirMoE can be trained with standard optimizers and mixed‑precision pipelines, simplifying integration into existing ML stacks (e.g., PyTorch, JAX).
  • Fine‑grained resource budgeting: The ELBO‑based sparsity penalty gives operators a knob to meet latency or memory constraints without post‑hoc pruning or heuristic Top‑k tuning.
  • Better expert reuse: Stronger specialization reduces redundant computation across experts, which can translate into lower GPU utilization and cost in large‑scale language model serving.
  • Transferability: The routing framework is model‑agnostic; it can replace the router in any MoE‑augmented architecture—vision transformers, speech models, or multimodal systems—potentially yielding similar gains.

Limitations & Future Work

  • Training overhead: The Gumbel‑Sigmoid relaxation and Dirichlet KL terms add modest compute and memory cost during training, though inference remains unchanged.
  • Hyper‑parameter sensitivity: The annealing schedule for temperature and Dirichlet concentration requires careful tuning; the authors note that sub‑optimal schedules can lead to either overly dense routing or premature expert collapse.
  • Limited benchmark scope: Experiments focus on language modeling; broader evaluation on downstream tasks (e.g., translation, code generation) and on non‑text modalities would strengthen the claim of universal applicability.
  • Future directions: The authors suggest exploring hierarchical Dirichlet routers for deeper expert hierarchies, integrating reinforcement‑learning‑style reward signals for task‑specific routing, and applying DirMoE to sparsely‑gated vision transformers.

Authors

  • Amirhossein Vahidi
  • Hesam Asadollahzadeh
  • Navid Akhavan Attar
  • Marie Moullet
  • Kevin Ly
  • Xingyi Yang
  • Mohammad Lotfollahi

Paper Information

  • arXiv ID: 2602.09001v1
  • Categories: cs.LG
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »