Mixture of Experts

A mixture-of-experts (MoE) model splits each feed-forward block into N "experts." A small router network picks which experts to send each token through. Only the chosen experts compute; the rest sit idle.

You get the benefit of a vast parameter pool — different experts can specialise on different patterns — without paying full compute per token.

Step 1 of 4

8 experts in the layer

Every feed-forward block contains N specialist networks. At rest, none are active.

All 8 experts exist; only 2 fire. Sparse compute, dense capability.

Why it works

Capacity without cost. Total parameters grow without proportional compute increase. DeepSeek-V3 has 670B parameters but activates ~37B per token.
Implicit specialisation. Experts naturally end up handling different domains: code, math, multilingual, etc. — without anyone designing this.

Where it bites

Memory. All experts must live in GPU memory; activating a few doesn't save VRAM, only FLOPs.
Routing instability. Bad routing wastes capacity; training MoEs is finicky.
Inference batching. Different tokens route to different experts, hurting batch efficiency.

What to read next

Scaling laws explain why bigger keeps working. Quantization and distillation are the alternative paths to capability-per-dollar.

You get the benefit of a vast parameter pool — different experts can specialise on different patterns — without paying full compute per token.

Step 1 of 4

8 experts in the layer

Every feed-forward block contains N specialist networks. At rest, none are active.

All 8 experts exist; only 2 fire. Sparse compute, dense capability.

Why it works

Capacity without cost. Total parameters grow without proportional compute increase. DeepSeek-V3 has 670B parameters but activates ~37B per token.
Implicit specialisation. Experts naturally end up handling different domains: code, math, multilingual, etc. — without anyone designing this.

Where it bites

Memory. All experts must live in GPU memory; activating a few doesn't save VRAM, only FLOPs.
Routing instability. Bad routing wastes capacity; training MoEs is finicky.
Inference batching. Different tokens route to different experts, hurting batch efficiency.

What to read next

Scaling laws explain why bigger keeps working. Quantization and distillation are the alternative paths to capability-per-dollar.