[Paper] Sparsely gated tiny linear experts

Published: (June 5, 2026 at 12:06 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.07414v1

Overview

Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonstrate that further increasing sparsity by shrinking each expert to consist of a single neuron and selecting a tiny fraction of many available neurons can improve compute efficiency and interpretability. Counterintuitively, the key to achieving both is removing the nonlinearity typically applied to the experts, resulting in a network of sparsely gated linear neurons (sgatlin). In an isoflop comparison, we find that replacing all transformer feedforward layers with sgatlin improves perplexity in language models across different compute budgets. At the same time, the sparsity and linearity of the resulting feedforward circuits present new opportunities for model interpretability. In a small-scale case study, we demonstrate that feedforward circuits in sgatlin can be interpreted without having to train additional replacement models. We find that they form semantically structured clusters and are causally implicated in factual recall. Our findings paint a possible path towards compute-efficient and interpretable transformer feedforward layers.

Key Contributions

This paper presents research in the following areas:

  • cs.LG
  • cs.NE

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

  • Simon Schug

Paper Information

  • arXiv ID: 2606.07414v1
  • Categories: cs.LG, cs.NE
  • Published: June 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »