Evals — AIght

Evaluation is the unglamorous backbone of AI work. Three rules nobody follows enough:

Step 1 of 4

Benchmark dataset

Capital of Japan?Tokyo

sqrt(144)?12

"Hamlet" author?Shakespeare

3 × 7 × 2?42

Boiling point of water (°C)?100

5 questions, 5 pre-labelled ground-truth answers. Every model sees the same 5.

Build your own eval set. Collect 50–200 real inputs from your actual use case. Hand-label the expected outputs. That's your truth.
Run it on every model change. Prompt tweak, model upgrade, temperature adjustment — re-run. Without this you don't know if you've improved or just shifted the failures.
Track failure modes, not just accuracy. "85% correct" tells you nothing about the 15%. Read them.

Eval types

Exact-match. Cheap, brittle, works for classification.
LLM-as-judge. Use a stronger model to grade outputs. Fast, decent, but has its own biases.
Human review. Slow, expensive, the gold standard. Sample 20–50 outputs weekly.
Pairwise preference. A vs. B; collect votes. Less noisy than absolute scoring.

Exact-match. Cheap, brittle, works for classification.
LLM-as-judge. Use a stronger model to grade outputs. Fast, decent, but has its own biases.
Human review. Slow, expensive, the gold standard. Sample 20–50 outputs weekly.
Pairwise preference. A vs. B; collect votes. Less noisy than absolute scoring.

RLHF uses preference evals as its core signal. Hallucination is the failure mode evals exist to catch.