GenAI Systems Lab Open interactive version →
AI Engineering

Mixture of Experts: Getting 671B Quality at 37B Cost

How MoE gives you a model with 671B parameters that costs the same to run as a 37B dense model. Sparse activation, expert routing, EPLB load balancing, and the failure modes that bite production teams.

Why MoE Exists

Scaling dense transformers is straightforward but expensive: double the parameters, double the compute per token. Mixture of Experts breaks this coupling. In an MoE model, the feedforward network (FFN) in each transformer layer is replaced by N expert networks. A learned router sends each token to only the top-K experts. Result: a model with the parameters of a large model but the inference compute of a small one.

DeepSeek-V3 has 671B parameters. Its inference cost matches a ~37B dense model. Every token activates ~37B parameters — the rest are dormant.

The Routing Mechanism

The router is a small linear layer that takes the token embedding and outputs a probability distribution over all N experts. Top-K selection (usually K=1 or K=2) determines which experts process each token. Experts that win the routing competition get the token; others don't. This is the entire mechanism — simple in description, subtle in training.

# Simplified MoE routing
router_logits = token_emb @ W_router      # [batch, seq, n_experts]
router_probs = softmax(router_logits)      # expert probabilities
top_k_idx = argtopk(router_probs, k=2)    # select top-2 experts
top_k_weights = router_probs[top_k_idx]   # weights for weighted sum

# Expert computation
output = 0
for i, weight in zip(top_k_idx, top_k_weights):
    output += weight * experts[i](token_emb)  # weighted combination

What Makes DeepSeek-V3 Different

DeepSeek-V3 pushes MoE to an extreme: 256 routed experts + 2 shared experts per layer. Shared experts always activate (handle common knowledge). 8 routed experts selected per token from 256 (top-8 of 256). This creates extreme specialisation — experts learn very narrow domains — while shared experts handle universal patterns.

ModelTotal ParamsActive ParamsExpert ConfigContext
Mixtral 8×7B46.7B12.9B8 experts, top-232K
DeepSeek-V3671B37B256 routed + 2 shared, top-8128K
Gemma 4 26B26B~7B8 experts, top-2128K
Grok-1314B86B8 experts, top-28K

Production Failure Modes

Serving MoE in Production

Three things change when you serve MoE vs. dense: memory layout (all expert weights must fit in GPU memory — for very large MoE this means expert parallelism across GPUs), batching sensitivity (MoE needs large batches to keep experts busy — continuous batching is essential), and load balancing (EPLB, now in vLLM, dynamically replicates overloaded experts and redistributes routing at serve time).

Health metrics to monitor: per-expert utilization (target: no expert >40% of tokens), routing entropy (high = healthy), dropped token rate (target: <1%).

When to Use MoE

→ Interactive: The MoE Architecture module in Systems Lab has interactive routing diagrams and failure mode walkthroughs.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →