AI Pattern · Evaluation

Judge Pattern

One model evaluates another’s output against a rubric (LLM-as-judge).

A judge model scores or compares outputs against defined criteria, a pass/fail, a numeric score, or a pick-the-better-of-two. It is the backbone of automated evaluation and of best-of-N selection, where you generate several candidates and keep the one the judge rates highest.

How it works

  1. 1Define the criteria or rubric
  2. 2Generate one or more candidate outputs
  3. 3Ask the judge model to score or compare them
  4. 4Use the verdict, gate, rank, or pick best-of-N

Strengths

  • Scales evaluation without human raters
  • Best-of-N: generate several, keep the judged best
  • Applies a consistent rubric every time

Watch-outs

  • The judge is not ground truth, it has biases
  • A vague rubric produces noisy verdicts
  • Judging is itself a model call to validate

When to use it

Eval pipelines, quality gates before an answer ships, and ranking multiple candidates. Pair it with a clear rubric and spot-check the judge against human ratings.

Example prompt

Score the answer below from 1–5 for factual accuracy and relevance, using this rubric:
<rubric>

Return the score and a one-line reason.

Answer: <text>