AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

Evals

How you measure whether a model is good at the thing you actually care about.

Mankaran Singh·Updated May 17, 2026

Where this idea lives

PREREQUISITESTOOLS THAT SHOW ITEvalsPrompt EngineeringPrompt Engineering — The craft of talking to a model that will take you exactly as literally as it decides toRLHFRLHF — Humans rate, model learns, weird things happen — the post-training that made models pleasant to talk to.Scaling LawsScaling Laws — Why bigger keeps working — and the question of where it stops.Hallucination & GroundingHallucination & Grounding — Why AI models confidently make things up — and what you can actually do about itChatGPTChatGPTClaudeClaudeCommon misconception: Public benchmarks tell you which model is best.Common misconception: A 95% accuracy means 95% of cases work.Common misconception: Eval is something you do once before launch.
prereqsrelatedtoolsmisconceptions
shows up in:Software EngineeringPhysics & EngineeringMedicine & HealthcareLaw & Legal
You might think:Public benchmarks tell you which model is best.A 95% accuracy means 95% of cases work.Eval is something you do once before launch.

Common misconception

“Public benchmarks tell you which model is best for your work.”

They tell you which model performed best on someone else's specific test set. Public benchmarks get gamed (training on the test set, even inadvertently) and often don't match your task distribution. The benchmark a frontier model wins might be irrelevant to your case. Your own eval — even 50 hand-picked examples — beats a leaderboard.

Evaluation is the unglamorous backbone of AI work. Three rules nobody follows enough:

  1. Build your own eval set. Collect 50–200 real inputs from your actual use case. Hand-label the expected outputs. That's your truth.
  2. Run it on every model change. Prompt tweak, model upgrade, temperature adjustment — re-run. Without this you don't know if you've improved or just shifted the failures.
  3. Track failure modes, not just accuracy. "85% correct" tells you nothing about the 15%. Read them.

Eval types

  • Exact-match. Cheap, brittle, works for classification.
  • LLM-as-judge. Use a stronger model to grade outputs. Fast, decent, but has its own biases.
  • Human review. Slow, expensive, the gold standard. Sample 20–50 outputs weekly.
  • Pairwise preference. A vs. B; collect votes. Less noisy than absolute scoring.

What to read next

RLHF uses preference evals as its core signal. Hallucination is the failure mode evals exist to catch.

← Back to all conceptsBrowse tools →
beginner
Read time5 min read
UpdatedMay 2026
Sources5

Read next

  1. RLHF →
  2. Scaling Laws →
  3. Hallucination & Grounding →