Synthetic data is text generated by a model and then used to train another model (or even itself, in iterative refinement). It's now a standard part of every frontier training pipeline — not a desperate fallback.
Why it works
For verifiable tasks, you can validate synthetic examples before training:
- Math. Generate a problem, generate a solution, check the answer. Keep only the verified pairs.
- Code. Generate a function, generate tests, run the tests. Keep the ones that pass.
- Instruction-following. Generate (instruction, response) pairs with a teacher model; have a stronger model rate them; keep top quality.
For non-verifiable tasks (creative writing, opinions), synthetic data is much riskier — without ground truth you can't filter for quality and errors compound.
Where it's transformative
The reasoning-model wave (o1, R1) leans heavily on synthetic data.
Math and code problems with verifiable answers are an unlimited source.
Generate millions, filter by correctness, train on the filtered set.
Capability rises without needing new human data.
Where it bites
Uncurated synthetic data → model collapse. Repeated training-on-output loops narrow the distribution. The internet filling with AI text means filtering becomes a major effort for the next generation of training runs.
What to read next
Model collapse is the failure mode synthetic data produces when uncurated. Training is the broader process.