[Paper] Sparsely gated tiny linear experts

Published: 5 days ago (June 5, 2026 at 12:06 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.07414v1

Overview

Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonstrate that further increasing sparsity by shrinking each expert to consist of a single neuron and selecting a tiny fraction of many available neurons can improve compute efficiency and interpretability. Counterintuitively, the key to achieving both is removing the nonlinearity typically applied to the experts, resulting in a network of sparsely gated linear neurons (sgatlin). In an isoflop comparison, we find that replacing all transformer feedforward layers with sgatlin improves perplexity in language models across different compute budgets. At the same time, the sparsity and linearity of the resulting feedforward circuits present new opportunities for model interpretability. In a small-scale case study, we demonstrate that feedforward circuits in sgatlin can be interpreted without having to train additional replacement models. We find that they form semantically structured clusters and are causally implicated in factual recall. Our findings paint a possible path towards compute-efficient and interpretable transformer feedforward layers.

Key Contributions

This paper presents research in the following areas:

cs.LG
cs.NE

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Simon Schug

Paper Information

arXiv ID: 2606.07414v1
Categories: cs.LG, cs.NE
Published: June 5, 2026
PDF: Download PDF

[Paper] Sparsely gated tiny linear experts

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

[Paper] Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization