Chain-of-Thought

In 2022, a Google paper showed that adding "Let's think step by step" to math word problems improved accuracy from ~18% to ~58% on the GSM8K benchmark — using the same model, no retraining. The trick became known as chain-of-thought prompting, and it's one of the most durable findings in prompt engineering.

◉ INTERACTIVE

Roger has 5 tennis balls. He buys 2 more cans. Each can has 3 balls. How many does he have now?

Direct Answer

✗ wrong answer

~2 tokens · $0.0000003

With Chain-of-Thought

✓ correct answer

~40 tokens · $0.0000060

Direct guesses fail predictably on multi-step arithmetic. CoT pays for itself here.

How it works

When you ask a model "what's 17 × 24?" it tries to produce the answer in one token, and frequently flubs it. When you ask "what's 17 × 24? Show your work step by step," it produces:

17 × 24 = 17 × (20 + 4)
       = (17 × 20) + (17 × 4)
       = 340 + 68
       = 408

The model didn't get smarter. It got more space. Each intermediate step is now grounded in the previous tokens, and the final answer is constrained by all of them. Errors that would compound silently in a one-shot answer get exposed and (often) corrected.

This works for any task that decomposes: math, multi-hop reasoning, code debugging, legal analysis, complex extraction. It doesn't work for tasks that are atomic: identification, single-fact recall, classification.

The variants worth knowing

Zero-shot CoT. Just adding "Let's think step by step" — almost free, often a 10–30% boost on reasoning tasks.
Few-shot CoT. Showing 2–3 examples of step-by-step reasoning before your actual question — biggest gains, costs more tokens.
Self-consistency. Run CoT multiple times with temperature > 0, take the majority answer. Significant accuracy improvement, but you pay for N runs.
Tree-of-thought. Let the model branch into multiple reasoning paths and evaluate them. More expensive, occasionally worth it for truly hard problems.

When it's expensive theater

For simple lookups, classification, and one-step answers, CoT just adds tokens (and cost). On modern reasoning-tuned models (o1, o3, Claude Opus), much of CoT is baked in — adding "think step by step" to those models can actively hurt because they're already doing internal reasoning and the explicit framing redirects them.

A useful test: does removing the CoT make the answer worse? If not, you're paying for tokens you don't need.

Why this matters for your work

If you're building anything that involves multi-step decisions (quoting, diagnosing, debugging, planning), explicit CoT usually helps and is cheap to try. For the new reasoning models, trust them to do it internally — don't over-prompt.

For evaluation, watch your benchmarks: the same model on the same test gives very different scores depending on whether CoT was used, and whether you average over multiple runs. Headline numbers are almost always best-case.

What to read next

Structured output pairs well with CoT — let the model reason in prose, then constrain the final answer to JSON. Prompt engineering is the broader skill of which CoT is one technique. In-context learning is the underlying mechanism that makes CoT examples work.

How it works

When you ask a model "what's 17 × 24?" it tries to produce the answer in one token, and frequently flubs it. When you ask "what's 17 × 24? Show your work step by step," it produces:

17 × 24 = 17 × (20 + 4)
       = (17 × 20) + (17 × 4)
       = 340 + 68
       = 408

The variants worth knowing

Zero-shot CoT. Just adding "Let's think step by step" — almost free, often a 10–30% boost on reasoning tasks.

Few-shot CoT. Showing 2–3 examples of step-by-step reasoning before your actual question — biggest gains, costs more tokens.

Self-consistency. Run CoT multiple times with temperature > 0, take the majority answer. Significant accuracy improvement, but you pay for N runs.

Tree-of-thought. Let the model branch into multiple reasoning paths and evaluate them. More expensive, occasionally worth it for truly hard problems.

When it's expensive theater

A useful test: does removing the CoT make the answer worse? If not, you're paying for tokens you don't need.

Why this matters for your work