Mixture of Experts: Getting 671B Quality at 37B Cost
How MoE gives you a model with 671B parameters that costs the same to run as a 37B dense model. Sparse activation, expert routing, EPLB load balancing, and the failure modes that bite production teams.
Why MoE Exists
Scaling dense transformers is straightforward but expensive: double the parameters, double the compute per token. Mixture of Experts breaks this coupling. In an MoE model, the feedforward network (FFN) in each transformer layer is replaced by N expert networks. A learned router sends each token to only the top-K experts. Result: a model with the parameters of a large model but the inference compute of a small one.
DeepSeek-V3 has 671B parameters. Its inference cost matches a ~37B dense model. Every token activates ~37B parameters — the rest are dormant.
The Routing Mechanism
The router is a small linear layer that takes the token embedding and outputs a probability distribution over all N experts. Top-K selection (usually K=1 or K=2) determines which experts process each token. Experts that win the routing competition get the token; others don't. This is the entire mechanism — simple in description, subtle in training.
# Simplified MoE routing
router_logits = token_emb @ W_router # [batch, seq, n_experts]
router_probs = softmax(router_logits) # expert probabilities
top_k_idx = argtopk(router_probs, k=2) # select top-2 experts
top_k_weights = router_probs[top_k_idx] # weights for weighted sum
# Expert computation
output = 0
for i, weight in zip(top_k_idx, top_k_weights):
output += weight * experts[i](token_emb) # weighted combination
What Makes DeepSeek-V3 Different
DeepSeek-V3 pushes MoE to an extreme: 256 routed experts + 2 shared experts per layer. Shared experts always activate (handle common knowledge). 8 routed experts selected per token from 256 (top-8 of 256). This creates extreme specialisation — experts learn very narrow domains — while shared experts handle universal patterns.
| Model | Total Params | Active Params | Expert Config | Context |
|---|---|---|---|---|
| Mixtral 8×7B | 46.7B | 12.9B | 8 experts, top-2 | 32K |
| DeepSeek-V3 | 671B | 37B | 256 routed + 2 shared, top-8 | 128K |
| Gemma 4 26B | 26B | ~7B | 8 experts, top-2 | 128K |
| Grok-1 | 314B | 86B | 8 experts, top-2 | 8K |
Production Failure Modes
- Expert collapse: router sends >80% of tokens to 1-2 experts. Fix: auxiliary load-balancing loss (z-loss) during training.
- Load imbalance: uneven expert utilization causes GPU compute stragglers. Fix: EPLB (Expert-Level Load Balancing) in vLLM replicates hot experts dynamically.
- Token dropping: expert capacity exceeded, tokens dropped silently. Fix: capacity factor ≥1.25, log dropped token rate.
- Router oscillation: training instability where token-to-expert assignments flip rapidly. Fix: lower router learning rate, noisy top-k gating.
Serving MoE in Production
Three things change when you serve MoE vs. dense: memory layout (all expert weights must fit in GPU memory — for very large MoE this means expert parallelism across GPUs), batching sensitivity (MoE needs large batches to keep experts busy — continuous batching is essential), and load balancing (EPLB, now in vLLM, dynamically replicates overloaded experts and redistributes routing at serve time).
Health metrics to monitor: per-expert utilization (target: no expert >40% of tokens), routing entropy (high = healthy), dropped token rate (target: <1%).
When to Use MoE
- Need >30B-quality responses but <30B inference cost → MoE is the answer
- Running on fewer than 2 GPUs → use dense (MoE needs memory for all expert weights even if most are dormant per token)
- Latency-critical real-time API → dense has less routing overhead
- Training from scratch at scale → MoE is likely worth the engineering complexity at 70B+ parameter targets
→ Interactive: The MoE Architecture module in Systems Lab has interactive routing diagrams and failure mode walkthroughs.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →