AI Pattern · Evaluation

Judge Pattern

One model evaluates another’s output against a rubric (LLM-as-judge).

A judge model scores or compares outputs against defined criteria, a pass/fail, a numeric score, or a pick-the-better-of-two. It is the backbone of automated evaluation and of best-of-N selection, where you generate several candidates and keep the one the judge rates highest.

How it works

1Define the criteria or rubric
2Generate one or more candidate outputs
3Ask the judge model to score or compare them
4Use the verdict, gate, rank, or pick best-of-N

Strengths

Scales evaluation without human raters
Best-of-N: generate several, keep the judged best
Applies a consistent rubric every time

Watch-outs

The judge is not ground truth, it has biases
A vague rubric produces noisy verdicts
Judging is itself a model call to validate

When to use it

Eval pipelines, quality gates before an answer ships, and ranking multiple candidates. Pair it with a clear rubric and spot-check the judge against human ratings.

Example prompt

Score the answer below from 1–5 for factual accuracy and relevance, using this rubric:
<rubric>

Return the score and a one-line reason.

Answer: <text>