AI Engineering

Mixture of Experts: Getting 671B Quality at 37B Cost

How MoE gives you a model with 671B parameters that costs the same to run as a 37B dense model. Sparse activation, expert routing, EPLB load balancing, and the failure modes that bite production teams.

Why MoE Exists

Scaling dense transformers is straightforward but expensive: double the parameters, double the compute per token. Mixture of Experts breaks this coupling. In an MoE model, the feedforward network (FFN) in each transformer layer is replaced by N expert networks. A learned router sends each token to only the top-K experts. Result: a model with the parameters of a large model but the inference compute of a small one.

DeepSeek-V3 has 671B parameters. Its inference cost matches a ~37B dense model. Every token activates ~37B parameters — the rest are dormant.

The Routing Mechanism

The router is a small linear layer that takes the token embedding and outputs a probability distribution over all N experts. Top-K selection (usually K=1 or K=2) determines which experts process each token. Experts that win the routing competition get the token; others don't. This is the entire mechanism — simple in description, subtle in training.

# Simplified MoE routing
router_logits = token_emb @ W_router      # [batch, seq, n_experts]
router_probs = softmax(router_logits)      # expert probabilities
top_k_idx = argtopk(router_probs, k=2)    # select top-2 experts
top_k_weights = router_probs[top_k_idx]   # weights for weighted sum

# Expert computation
output = 0
for i, weight in zip(top_k_idx, top_k_weights):
    output += weight * experts[i](token_emb)  # weighted combination

What Makes DeepSeek-V3 Different

DeepSeek-V3 pushes MoE to an extreme: 256 routed experts + 2 shared experts per layer. Shared experts always activate (handle common knowledge). 8 routed experts selected per token from 256 (top-8 of 256). This creates extreme specialisation — experts learn very narrow domains — while shared experts handle universal patterns.

Model	Total Params	Active Params	Expert Config	Context
Mixtral 8×7B	46.7B	12.9B	8 experts, top-2	32K
DeepSeek-V3	671B	37B	256 routed + 2 shared, top-8	128K
Gemma 4 26B	26B	~7B	8 experts, top-2	128K
Grok-1	314B	86B	8 experts, top-2	8K

Production Failure Modes

Expert collapse: router sends >80% of tokens to 1-2 experts. Fix: auxiliary load-balancing loss (z-loss) during training.
Load imbalance: uneven expert utilization causes GPU compute stragglers. Fix: EPLB (Expert-Level Load Balancing) in vLLM replicates hot experts dynamically.
Token dropping: expert capacity exceeded, tokens dropped silently. Fix: capacity factor ≥1.25, log dropped token rate.
Router oscillation: training instability where token-to-expert assignments flip rapidly. Fix: lower router learning rate, noisy top-k gating.

Serving MoE in Production

Three things change when you serve MoE vs. dense: memory layout (all expert weights must fit in GPU memory — for very large MoE this means expert parallelism across GPUs), batching sensitivity (MoE needs large batches to keep experts busy — continuous batching is essential), and load balancing (EPLB, now in vLLM, dynamically replicates overloaded experts and redistributes routing at serve time).

Health metrics to monitor: per-expert utilization (target: no expert >40% of tokens), routing entropy (high = healthy), dropped token rate (target: <1%).

When to Use MoE

Need >30B-quality responses but <30B inference cost → MoE is the answer
Running on fewer than 2 GPUs → use dense (MoE needs memory for all expert weights even if most are dormant per token)
Latency-critical real-time API → dense has less routing overhead
Training from scratch at scale → MoE is likely worth the engineering complexity at 70B+ parameter targets

→ Interactive: The MoE Architecture module in Systems Lab has interactive routing diagrams and failure mode walkthroughs.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →