Mixtral and Mixture of Experts: Activating 13B Parameters to Deliver 70B Quality
Mistral AI's 2024 MoE paper. How sparse expert routing lets models have 47B parameters but only use 13B per token — matching GPT-3.5 at a fraction of the inference cost.
A 70B parameter model is expensive to serve. Every token generation activates all 70 billion parameters. What if you could have 70B-worth of quality but only activate a fraction of those parameters per token?
In January 2024, Mistral AI released Mixtral 8x7B — 47B total parameters, only 13B active per token. It matches or exceeds GPT-3.5 at roughly half the inference cost. GPT-4 is widely believed to use a similar Mixture of Experts architecture internally.
How mixture of experts works
In a standard dense Transformer, each layer has one feed-forward network processing every token. In a MoE layer, there are multiple FFNs (experts), and a router network decides which experts to activate per token.
Standard: token → single FFN → output
MoE: token → router → [select top-2 of 8 experts]
↓ ↓
Expert 3 + Expert 7 → weighted sum → output
[Experts 1,2,4,5,6,8 NOT activated for this token]
Mixtral 8x7B: 47B total params, ~13B active per token
Think of Mixtral as: 47B parameters of knowledge capacity, 13B parameters of compute per token. The 8x7B name is misleading — it's not 8 complete 7B models. It's 8 expert FFNs per MoE layer.
Expert specialisation
Experts specialise automatically without any explicit training signal. Analysis of Mixtral showed different experts preferentially handle different domains — some activate more for code, others for languages, others for factual knowledge. This emerged purely from the router's learned behaviour.
MoE tradeoffs in production
| Aspect | Dense 13B | Mixtral 8x7B (~13B active) |
|---|---|---|
| Total VRAM needed | ~26GB (fp16) | ~90GB — all experts must fit in memory |
| Compute per token | 13B params | ~13B params (similar) |
| Knowledge capacity | 13B-equivalent | 47B-equivalent |
| Fine-tuning complexity | Straightforward | Needs load balancing across experts |
Key tradeoff: MoE models need all expert weights in memory even though only some are used per token. Memory-hungry despite being compute-efficient. Multi-GPU serving required even for the 'efficient' inference case.
Compare Mixtral and dense models →: See how MoE benchmarks against dense alternatives on production-relevant tasks.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →