Mixtral of Experts

Published: 1 month ago (December 26, 2025 at 05:40 PM EST)

1 min read

Source: Dev.to

Overview

Mixtral 8x7B is a language model that distributes tasks across many tiny specialists, achieving both speed and intelligence. It employs a Sparse Mixture of Experts architecture where each layer contains eight feed‑forward blocks, and a router selects two experts for every token. The selected pair can change at each step.

Architecture

Sparse Mixture of Experts: Each token can access up to 47 B parameters in total, while only about 13 B active parameters are used during inference, reducing computational cost.
Routing: A small router dynamically picks two experts per token, allowing the model to adapt its computation on the fly.

Training and Performance

Trained for very long contexts, handling up to 32 k tokens.
Matches or outperforms much larger models on benchmarks, especially in math, coding, and multilingual tasks.
An instruction‑tuned version surpasses several popular chat models in human evaluations.

Both the base and instruction‑tuned versions are released under the Apache 2.0 license, enabling the community to experiment with them.

Mixtral of Experts

Overview

Architecture

Training and Performance

Further Reading

Related posts

Prompt Engineering From First Principles: The Mechanics They Don't Teach You part-1

Building Your Career in AI: Real Talk from the Trenches

New Year's AI surprise: Fal releases its own version of Flux 2 image generator that's 10x cheaper and 6x more efficient

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others