AI Pattern · Evaluation
Judge Pattern
One model evaluates another’s output against a rubric (LLM-as-judge).
A judge model scores or compares outputs against defined criteria, a pass/fail, a numeric score, or a pick-the-better-of-two. It is the backbone of automated evaluation and of best-of-N selection, where you generate several candidates and keep the one the judge rates highest.
How it works
- 1Define the criteria or rubric
- 2Generate one or more candidate outputs
- 3Ask the judge model to score or compare them
- 4Use the verdict, gate, rank, or pick best-of-N
Strengths
- Scales evaluation without human raters
- Best-of-N: generate several, keep the judged best
- Applies a consistent rubric every time
Watch-outs
- The judge is not ground truth, it has biases
- A vague rubric produces noisy verdicts
- Judging is itself a model call to validate
When to use it
Eval pipelines, quality gates before an answer ships, and ranking multiple candidates. Pair it with a clear rubric and spot-check the judge against human ratings.
Example prompt
Score the answer below from 1–5 for factual accuracy and relevance, using this rubric: <rubric> Return the score and a one-line reason. Answer: <text>