Distillation — AIght

Distillation trains a smaller "student" model on the outputs of a larger "teacher." The teacher generates examples — questions and answers, or in the soft-distillation variant, full probability distributions — and the student learns to reproduce them.

The result: a model with the teacher's behaviour and the student's inference cost.

Step 1 of 4

The teacher model.

A large frontier model — capable, expensive to run.

Teacher model

70B

parameters

Benchmark

100%

Latency

~2.4 s

Cost / 1M tok

$15

Why it works

The teacher does the expensive work of figuring out what good outputs look like for a given input distribution. The student gets to learn from already-processed, high-quality examples — much more efficient than learning from raw web data.

What it costs

The student is bounded by the teacher. It can't surpass the teacher on the teacher's blind spots. It often inherits subtle biases too — if the teacher was prone to a specific failure, the student picks it up.

It also requires running the teacher a lot. For a serious distillation you generate millions of examples — significant API spend if the teacher is a frontier proprietary model.

In practice

Most "small" frontier models — GPT-4 mini, Claude Haiku, Gemini Flash — are distilled from their larger siblings. The pricing reflects the inference difference, not the training cost.

What to read next

Quantization is the orthogonal compression technique. Fine-tuning is the broader process distillation is one variant of.

What it costs

It also requires running the teacher a lot. For a serious distillation you generate millions of examples — significant API spend if the teacher is a frontier proprietary model.