A mixture-of-experts (MoE) model splits each feed-forward block into N "experts." A small router network picks which experts to send each token through. Only the chosen experts compute; the rest sit idle.
You get the benefit of a vast parameter pool — different experts can specialise on different patterns — without paying full compute per token.
Why it works
- Capacity without cost. Total parameters grow without proportional compute increase. DeepSeek-V3 has 670B parameters but activates ~37B per token.
- Implicit specialisation. Experts naturally end up handling different domains: code, math, multilingual, etc. — without anyone designing this.
Where it bites
- Memory. All experts must live in GPU memory; activating a few doesn't save VRAM, only FLOPs.
- Routing instability. Bad routing wastes capacity; training MoEs is finicky.
- Inference batching. Different tokens route to different experts, hurting batch efficiency.
What to read next
Scaling laws explain why bigger keeps working. Quantization and distillation are the alternative paths to capability-per-dollar.