[Paper] EMO: Pretraining Mixture of Experts for Emergent Modularity

Published: 3 days ago (May 7, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06663v1

Overview

The paper EMO: Pretraining Mixture of Experts for Emergent Modularity tackles a long‑standing pain point of large language models (LLMs): they are monolithic, meaning the entire model must be loaded even when a downstream task only needs a narrow slice of its knowledge (e.g., code generation or math reasoning). EMO proposes a new Mixture‑of‑Experts (MoE) pre‑training recipe that automatically groups experts into coherent, domain‑specific modules, allowing developers to load and run only the relevant subset at inference time without a noticeable loss in quality.

Key Contributions

Emergent Modularity without Hand‑crafted Priors – Introduces a simple training constraint that forces tokens from the same document to draw experts from a shared pool, letting domain‑level expert groups form organically.
Scalable Pre‑training – Trains a 1 B‑parameter “active” MoE (14 B total parameters) on a trillion tokens, matching the performance of conventional MoEs when the full model is used.
Selective Expert Activation – Demonstrates that keeping only 25 % (or even 12.5 %) of the experts incurs just a 1 % (or 3 %) absolute drop in accuracy, whereas standard MoEs collapse under the same pruning.
Semantic Expert Specialization – Shows that EMO’s expert subsets specialize at high‑level semantic domains (e.g., mathematics, programming) rather than low‑level syntactic patterns typical of classic MoEs.
Memory‑Efficient Deployment Blueprint – Provides a concrete path for deploying massive sparse models on devices with limited RAM by loading only the needed expert pool.

Methodology

Document‑Level Expert Pooling – During pre‑training, each input document is assigned a shared expert pool (a small random subset of all experts). All tokens in that document can only route to experts inside this pool. Different documents receive different pools, encouraging the model to discover which experts are best suited for a given domain.
Standard MoE Routing + Pool Constraint – The usual top‑k routing (e.g., top‑2 experts per token) is kept, but the candidate list is intersected with the document’s pool. This adds negligible overhead while enforcing the grouping bias.
Training Regime – The model is trained on 1 T tokens using the same objectives as typical language‑model pre‑training (next‑token prediction). No extra supervision about domains or tasks is required; the document boundaries act as the only signal.
Inference Flexibility – At test time, a user can either (a) run the full model, (b) specify a domain and load only the corresponding expert pool, or (c) arbitrarily prune a percentage of experts. The routing mechanism automatically falls back to the available experts.

Results & Findings

Setting	Metric (e.g., average accuracy on standard LM benchmarks)	Drop vs. Full MoE
Full EMO (all experts)	≈ baseline MoE performance	–
25 % experts kept	< 1 % absolute loss	Minimal
12.5 % experts kept	≈ 3 % absolute loss	Still usable
Standard MoE with same pruning	> 10 % loss, often catastrophic	Poor

Additional observations

Semantic clustering – Probing the learned experts reveals clear clusters aligned with high‑level topics (math, code, scientific text).
Stability – The emergent modularity appears early in training (after ~200 B tokens) and persists, indicating the constraint is robust.
Compute overhead – The pool constraint adds < 2 % extra FLOPs compared to a vanilla MoE.

Practical Implications

Deploy on Edge / Low‑Memory Servers – Companies can ship a single 14 B‑parameter MoE model but only load the 3–4 B‑parameter expert pool relevant to a SaaS feature (e.g., code completion), cutting RAM usage by 75 %+.
Domain‑Specific Fine‑Tuning Becomes Cheaper – Instead of fine‑tuning a full model for each niche, developers can fine‑tune just the expert pool that already specializes in that domain, accelerating iteration cycles.
Composable AI Services – Multiple expert pools can be combined on‑the‑fly to build multi‑modal pipelines (e.g., a chatbot that needs both math reasoning and code generation) without re‑loading the entire model.
Cost‑Effective Inference – Cloud providers can charge per‑expert‑used, offering tiered pricing (basic vs. premium domains) while keeping latency low because fewer experts are activated per request.
Simplified Model Management – A single checkpoint replaces a zoo of task‑specific models, reducing versioning headaches and storage overhead.

Limitations & Future Work

Document Boundary Assumption – EMO relies on the notion that tokens within a document share a domain; highly heterogeneous documents may dilute expert specialization.
Static Expert Pools – The pool is chosen randomly at training time and stays fixed; dynamic pool selection based on input content could further improve efficiency.
Scalability to Hundreds of Billions of Parameters – Experiments stop at 14 B total parameters; it remains an open question how the emergent modularity behaves at the scale of 100 B+ models.
Evaluation on Downstream Tasks – The paper focuses on language‑model benchmarks; real‑world downstream evaluations (e.g., code generation APIs, retrieval‑augmented QA) would solidify the practical claims.
Security & Fairness – Partitioning experts may unintentionally isolate bias mitigation mechanisms; future work should explore how modularity interacts with responsible AI safeguards.

Authors

Ryan Wang
Akshita Bhagia
Sewon Min

Paper Information

arXiv ID: 2605.06663v1
Categories: cs.CL
Published: May 7, 2026
PDF: Download PDF

[Paper] EMO: Pretraining Mixture of Experts for Emergent Modularity

Overview

Key Contributions

Methodology

Results & Findings

Additional observations

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

[Paper] Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation