Reasoning Models

In late 2024, OpenAI released o1 — a model trained to generate long chains of internal reasoning before its visible answer. DeepSeek followed with R1, Anthropic with extended thinking modes. The category is "reasoning models" — and on math, code, and multi-step logic benchmarks, they leapfrog the older generation.

Graduate-level puzzle

A train leaves city A at 60 km/h. Another leaves city B (300 km away) at 90 km/h heading toward A. Where do they meet? Also: a third train leaves A 30 min later at 90 km/h — which train from A arrives first?

Fast model (GPT-4)~1 s · $0.001

180 km from city A. The second train arrives first.

✗ wrong answer

Reasoning model (o3)~30 s · $0.15

Press run to see the reasoning model work.

Same family of models. Different inference-time computation. Different cost. Different correctness.

What's different

Standard models generate the answer directly. Reasoning models generate a private chain-of-thought first (often thousands of tokens), then a short user-facing answer. The chain is invisible to the user but visible to the training reward — the model is RL'd against verifiable problems (the answer is right or wrong, no preference judging needed).

What it costs

Latency. A reasoning model can take 10–60s per query vs. under 2s for a standard chat. Not a fit for interactive UIs.
Tokens. You pay for the hidden reasoning tokens. A "simple" query can cost 10× more.
Diminishing returns. Beyond ~10k thinking tokens, accuracy plateaus. Capping helps.

Where it shines

Hard math, competitive programming, complex contract analysis, multi-step diagnosis, anything with a verifiable answer. Less useful for open-ended creative work or simple lookups.

What to read next

Chain-of-thought is the prompt-time precursor. RLHF is the post-training that taught models to follow instructions; reasoning models extend that with RL on verifiable tasks.

What's different