Synthetic Data

Synthetic data is text generated by a model and then used to train another model (or even itself, in iterative refinement). It's now a standard part of every frontier training pipeline — not a desperate fallback.

Step 1 of 4

Seed prompts.

A small set of human-curated examples — the starting material.

seed 1"What is 12 × 15?"

seed 2"Sort this list in Python."

seed 3"Explain photosynthesis briefly."

3 seeds → next step expands each to 10 variations

Why it works

For verifiable tasks, you can validate synthetic examples before training:

Math. Generate a problem, generate a solution, check the answer. Keep only the verified pairs.
Code. Generate a function, generate tests, run the tests. Keep the ones that pass.
Instruction-following. Generate (instruction, response) pairs with a teacher model; have a stronger model rate them; keep top quality.

For non-verifiable tasks (creative writing, opinions), synthetic data is much riskier — without ground truth you can't filter for quality and errors compound.

Where it's transformative

The reasoning-model wave (o1, R1) leans heavily on synthetic data. Math and code problems with verifiable answers are an unlimited source. Generate millions, filter by correctness, train on the filtered set. Capability rises without needing new human data.

Where it bites

Uncurated synthetic data → model collapse. Repeated training-on-output loops narrow the distribution. The internet filling with AI text means filtering becomes a major effort for the next generation of training runs.

What to read next

Model collapse is the failure mode synthetic data produces when uncurated. Training is the broader process.

Why it works

For verifiable tasks, you can validate synthetic examples before training:

Math. Generate a problem, generate a solution, check the answer. Keep only the verified pairs.

Code. Generate a function, generate tests, run the tests. Keep the ones that pass.

Instruction-following. Generate (instruction, response) pairs with a teacher model; have a stronger model rate them; keep top quality.

For non-verifiable tasks (creative writing, opinions), synthetic data is much riskier — without ground truth you can't filter for quality and errors compound.