LLM-as-Judge: Automating Your Evaluations
Scoring twenty outputs by hand against a rubric is fine. Scoring two hundred, every time you tweak a prompt, is not. The standard solution is LLM-as-judge: you use a second AI call, given your rubric, to score the outputs of your prompt. Done well, it lets you run a full eval in seconds and re-run it on every change. Done badly, it gives you confident numbers that are quietly wrong.
This lesson shows you how to build a judge you can trust, and the specific biases that make naive judges unreliable.
What You'll Learn
- How an LLM-as-judge eval is structured
- Why you should reuse your rubric, not invent a new one for the judge
- The main biases that distort AI judgments and how to control them
- Pointwise scoring versus pairwise comparison, and when to use each
- How to validate that your judge agrees with human judgment before you trust it
The Basic Setup
An LLM-as-judge eval has a simple shape. For each case in your eval set:
- Run your task prompt on the input to produce an output.
- Send that output, plus the input and your rubric, to a separate judge prompt.
- The judge returns a score (and, ideally, a short justification).
- Aggregate the scores across the eval set into one number.
The judge prompt is just the rubric-as-instructions you wrote in the previous lesson, wrapped so the model returns a structured score you can collect. Asking the judge to explain its score in one line before giving the number both improves the score's quality and lets you audit it when something looks off.
When LLM-as-Judge Fits, and When It Does Not
Use a judge when the criterion is nuanced: helpfulness, tone, clarity, instruction-following, or whether an answer is grounded in a source. These are exactly the things a script cannot check but a capable model can assess reasonably.
Do not use a judge when a programmatic check would do. If you can verify the answer with an exact match, a keyword check, or a JSON-valid test, do that instead. It is cheaper, faster, and not subject to the biases below. Reserve the judge for genuinely subjective quality.
The Biases You Must Control
A judge model is not a neutral oracle. Research and practice in 2026 consistently identify a few biases that will skew your numbers if you ignore them.
- Position bias. In a head-to-head comparison, judges tend to favor whichever answer is shown first, regardless of quality. Control: run each comparison both ways (A then B, and B then A) and only trust the result if the verdict is consistent.
- Verbosity bias. Judges tend to reward longer, more elaborate answers even when the extra length adds nothing. Control: add a rubric line that explicitly penalizes padding, and watch for a prompt that "wins" only by getting wordier.
- Self-preference (family) bias. A judge tends to prefer outputs from its own model family. Control: where it matters, use a judge from a different model family than the one that produced the output.
- Lenient drift. Pointwise scores tend to clump high and drift between runs. Control: prefer pairwise comparison for ranking decisions, and re-run to check stability.
None of these mean LLM-as-judge is unusable. They mean an uncontrolled judge is unreliable, and a controlled one is a powerful tool.
Pointwise Versus Pairwise
There are two ways to ask a judge to evaluate.
Pointwise (absolute): "Score this output from 0 to 5 on the rubric." Easy to aggregate, scales linearly with your eval set, but drifts and clumps. Best for tracking an absolute quality bar over time.
Pairwise (relative): "Here are two outputs for the same input. Which better satisfies the rubric, A or B?" More reliable, because choosing is easier and more stable than grading. The cost is more comparisons and the need to control position bias. Best when you are deciding which of two prompt versions to keep.
A practical rule: use pointwise to watch your absolute quality over many versions, and switch to pairwise when you need a confident verdict between two specific candidates. For the highest-stakes pairwise calls, ask the judge twice with the order flipped and only count the result if it agrees with itself.
A Reliable Pairwise Judge Prompt
Validate the Judge Before You Trust It
Here is the step almost everyone skips: check that your judge agrees with a human. Before you let a judge make decisions for you, hand-score ten to fifteen cases yourself, then have the judge score the same cases. Compare. If the judge mostly agrees with you, you can trust it on the rest of the set. If it disagrees often, your rubric is too vague or the judge needs a stronger prompt. Fix that first.
This calibration step is what separates a real eval from a number that merely looks rigorous. A judge you have not validated is just another guess wearing a lab coat.
Key Takeaways
- LLM-as-judge automates rubric scoring so you can run full evals on every change.
- Reuse your rubric as the judge's instructions and ask for a one-line justification.
- Control the known biases: position (swap order), verbosity (penalize padding), self-preference (cross-family judge), and lenient drift (prefer pairwise, re-run for stability).
- Use pointwise to track absolute quality over time and pairwise to decide between two candidates.
- Always validate the judge against your own hand-scoring before trusting it.

