In late 2024, OpenAI released o1 — a model trained to generate long
chains of internal reasoning before its visible answer. DeepSeek
followed with R1, Anthropic with extended thinking modes. The category
is "reasoning models" — and on math, code, and multi-step logic
benchmarks, they leapfrog the older generation.
What's different
Standard models generate the answer directly. Reasoning models generate a private chain-of-thought first (often thousands of tokens), then a short user-facing answer. The chain is invisible to the user but visible to the training reward — the model is RL'd against verifiable problems (the answer is right or wrong, no preference judging needed).
What it costs
- Latency. A reasoning model can take 10–60s per query vs. under 2s for a standard chat. Not a fit for interactive UIs.
- Tokens. You pay for the hidden reasoning tokens. A "simple" query can cost 10× more.
- Diminishing returns. Beyond ~10k thinking tokens, accuracy plateaus. Capping helps.
Where it shines
Hard math, competitive programming, complex contract analysis, multi-step diagnosis, anything with a verifiable answer. Less useful for open-ended creative work or simple lookups.
What to read next
Chain-of-thought is the prompt-time precursor. RLHF is the post-training that taught models to follow instructions; reasoning models extend that with RL on verifiable tasks.