What are AI evals?
If a prompt change fixes one case and silently breaks three others, how would you know? Evals are how. They turn 'it felt better' into a number you can track, the verify gate from loops, applied at scale.
The problem evals solve
LLM output is non-deterministic and easy to fool yourself about. You tweak a prompt, the one example you tried looks great, you ship, and three other cases quietly regressed. An eval (short for evaluation) is a repeatable test that measures output quality across many cases, so “it seemed to work” becomes “it passes 47 of 50.” It is the same idea as the verify gate in AI loops, scaled from one run to a whole test set, and it is how you catch hallucinations before users do.
The anatomy of an eval
Every eval has three parts: a dataset of inputs (often with an expected output or a rubric), a scorer that decides pass or fail (or a number) for each, and a metric you track over time. Run it after every prompt or model change and it becomes regression testing for AI: you see instantly whether a change helped overall or just helped the one case you were staring at. A small, carefully chosen set of real cases beats a huge sloppy one.
The four ways to score
Picking the right scorer is the whole game, and it depends entirely on what “good” means for your task:
- Exact / deterministic: the output must match a known answer, or pass a test, a regex, or a JSON schema. Cheap, fast, and unambiguous. Best when there is one right answer (extraction, classification, code that must pass tests).
- Rubric scoring: grade against an explicit checklist (covers the key points, right tone, no banned phrases). Best for open-ended quality where there is no single correct string.
- LLM-as-judge: a second model scores the output against your criteria. This scales rubric grading to thousands of cases for cents, but the judge is itself a model, it can be biased or gamed, so you validate it against human scores before trusting it.
- Human review: the gold standard. Slow and expensive, reserved for high-stakes decisions and for calibrating the cheaper methods above.
For each task, which scoring method fits best?
The honest caveats
Evals are powerful and easy to misuse. Three rules keep them honest: validate your judge, an LLM-as-judge is not ground truth until you have checked its scores against human ones on a sample. Do not optimize to the eval, if you tune endlessly against the same 50 cases, you start fitting the test instead of the task (Goodhart’s law). And keep a holdout, some cases you do not look at while iterating, so your number reflects real performance, not memorized answers. An eval you have quietly gamed is worse than none, because it gives false confidence.
Where to go next
- AI loops, where a single eval becomes the gate that decides when to stop.
- Hallucinations, the failure mode evals are built to catch.
- Structured outputs, what makes exact-match scoring possible.