Evaluation is the unglamorous backbone of AI work. Three rules nobody follows enough:
- Build your own eval set. Collect 50–200 real inputs from your actual use case. Hand-label the expected outputs. That's your truth.
- Run it on every model change. Prompt tweak, model upgrade, temperature adjustment — re-run. Without this you don't know if you've improved or just shifted the failures.
- Track failure modes, not just accuracy. "85% correct" tells you nothing about the 15%. Read them.
Eval types
- Exact-match. Cheap, brittle, works for classification.
- LLM-as-judge. Use a stronger model to grade outputs. Fast, decent, but has its own biases.
- Human review. Slow, expensive, the gold standard. Sample 20–50 outputs weekly.
- Pairwise preference. A vs. B; collect votes. Less noisy than absolute scoring.
What to read next
RLHF uses preference evals as its core signal. Hallucination is the failure mode evals exist to catch.