AI Engineering 9 min read

Mixtral and Mixture of Experts: Activating 13B Parameters to Deliver 70B Quality

Mistral AI's 2024 MoE paper. How sparse expert routing lets models have 47B parameters but only use 13B per token — matching GPT-3.5 at a fraction of the inference cost.

A 70B parameter model is expensive to serve. Every token generation activates all 70 billion parameters. What if you could have 70B-worth of quality but only activate a fraction of those parameters per token?

In January 2024, Mistral AI released Mixtral 8x7B — 47B total parameters, only 13B active per token. It matches or exceeds GPT-3.5 at roughly half the inference cost. GPT-4 is widely believed to use a similar Mixture of Experts architecture internally.

How mixture of experts works

In a standard dense Transformer, each layer has one feed-forward network processing every token. In a MoE layer, there are multiple FFNs (experts), and a router network decides which experts to activate per token.

Standard: token → single FFN → output

MoE: token → router → [select top-2 of 8 experts]
        ↓         ↓
  Expert 3  +  Expert 7  → weighted sum → output
  [Experts 1,2,4,5,6,8 NOT activated for this token]

Mixtral 8x7B: 47B total params, ~13B active per token

Think of Mixtral as: 47B parameters of knowledge capacity, 13B parameters of compute per token. The 8x7B name is misleading — it's not 8 complete 7B models. It's 8 expert FFNs per MoE layer.

Expert specialisation

Experts specialise automatically without any explicit training signal. Analysis of Mixtral showed different experts preferentially handle different domains — some activate more for code, others for languages, others for factual knowledge. This emerged purely from the router's learned behaviour.

MoE tradeoffs in production

Aspect	Dense 13B	Mixtral 8x7B (~13B active)
Total VRAM needed	~26GB (fp16)	~90GB — all experts must fit in memory
Compute per token	13B params	~13B params (similar)
Knowledge capacity	13B-equivalent	47B-equivalent
Fine-tuning complexity	Straightforward	Needs load balancing across experts

Key tradeoff: MoE models need all expert weights in memory even though only some are used per token. Memory-hungry despite being compute-efficient. Multi-GPU serving required even for the 'efficient' inference case.

Compare Mixtral and dense models →: See how MoE benchmarks against dense alternatives on production-relevant tasks.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →