Temperature & Sampling

At each step of generation, the model produces a probability distribution over the next token. Temperature controls how that distribution gets sampled. Low temperature → the model picks the most likely next token, every time. High temperature → the model is more willing to pick a less-likely token.

Temperature1.0

0.01.02.0

Top-koff

1510

Top-p (nucleus)off

0.10.51.0

The cat sat on the ___

The knobs

Temperature. A scalar that divides the logits before softmax. At temperature=0, the model picks the argmax — same input gives same output. At temperature=1, you sample from the unmodified distribution. At temperature=2, the distribution gets flatter; even low-probability tokens have a real shot.

Top-p (nucleus sampling). Instead of considering all 50,000 tokens weighted by probability, only consider the smallest set whose total probability adds up to p. top_p=0.9 typically means "from the most likely 10–50 tokens" depending on how confident the model is. Caps the worst-case randomness without flattening the whole distribution.

Top-k. Only consider the top k most likely tokens. top_k=40 is common. Simpler than top-p, less calibrated.

These are usually combined: pick the top-p subset, then sample from that with temperature.

What temperature 0 actually means

temperature=0 makes generation greedy, not deterministic. Same prompt usually gives same output, but the same prompt with slightly different context (or KV cache state, or hardware) can give different results. Most providers will tell you the result is "near-deterministic" in their docs, which is provider-speak for "almost always but don't build infrastructure on it."

Why this matters for your work

For factual tasks — answering a question, classifying, extracting — lower temperature is almost always better. The most likely token is usually the right one. Random sampling hurts you here.

For creative tasks, the sweet spot depends on the form:

Tightly structured prose (legal, technical writing): 0.3–0.5.
Marketing copy, headlines: 0.7–0.9.
Open-ended brainstorming, fiction first drafts: 0.9–1.1.
"Surprise me" exploration: 1.2+. Treat the output as raw clay.

For generation tasks where you'll pick from many outputs (image generation, song generation), higher temperature with multiple samples is usually better than one well-tuned generation.

What to read next

Structured output is the technique for forcing low-temperature determinism even on creative tasks. Prompt engineering is the control surface upstream of all of this. Chain-of-thought is when the model talks itself into a better answer.

The knobs

Top-k. Only consider the top k most likely tokens. top_k=40 is common. Simpler than top-p, less calibrated.

These are usually combined: pick the top-p subset, then sample from that with temperature.

What temperature 0 actually means

Why this matters for your work

For creative tasks, the sweet spot depends on the form:

Tightly structured prose (legal, technical writing): 0.3–0.5.

Marketing copy, headlines: 0.7–0.9.

Open-ended brainstorming, fiction first drafts: 0.9–1.1.

"Surprise me" exploration: 1.2+. Treat the output as raw clay.

For generation tasks where you'll pick from many outputs (image generation, song generation), higher temperature with multiple samples is usually better than one well-tuned generation.