Rubrics and Metrics: Scoring Prompt Quality Objectively
You have an eval set. Now you need a way to turn each output into a number so you can compare prompt versions. This lesson is about scoring: choosing the right metric for the task and writing rubrics precise enough that two different people (or the same AI judge twice) would score the same output the same way.
The goal is consistency. A score that swings around depending on mood is worse than no score at all, because it will tell you a prompt change helped when it did nothing.
What You'll Learn
- The three families of metrics and when each fits
- How to write a rubric with clear, separated criteria
- Why a 1 to 5 scale needs anchored descriptions, not just numbers
- How to combine multiple criteria into one comparable score
- How to spot a vague rubric that will produce noisy scores
Three Families of Metrics
Match the metric to what the task actually requires.
1. Exact and programmatic metrics. Used when correctness is mechanical. Examples: did the output equal the expected label, does it contain the required keyword, is it valid JSON, is it under the word limit, does the extracted number match. These are pass/fail or a simple percentage, and a script or a careful human can apply them with zero ambiguity. Prefer these whenever the task allows, because they are the most trustworthy.
2. Rubric metrics. Used when quality is a matter of degree: is this summary good, is this email professional, is this explanation clear. You define named criteria and a scale, then score each criterion. This is where most open-ended prompts live.
3. Preference metrics. Used when you care about relative quality: is output A better than output B. Instead of an absolute score you get a win rate. You will go deep on this in the A/B testing lesson; for now, know it exists as a third option.
Anatomy of a Good Rubric
A rubric is a set of criteria, each with a scale, where each point on the scale has a written description. The descriptions are the part people skip, and skipping them is why their scores are noisy.
Compare these two rubrics for scoring a meeting-summary prompt.
Weak rubric (will produce noisy scores):
- Quality: 1 to 5
That is just a vibe with a number attached. Two graders will disagree constantly.
Strong rubric (separated criteria, anchored scale):
- Accuracy (0 to 2): 0 = contains a claim not supported by the transcript. 1 = all claims supported but missing a key decision. 2 = all claims supported and includes every decision made.
- Completeness (0 to 2): 0 = misses major topics. 1 = covers main topics, skips action items. 2 = covers topics and lists action items with owners.
- Concision (0 to 1): 0 = over 150 words or padded with filler. 1 = under 150 words, no filler.
Total possible: 5. Now two graders looking at the same summary will almost always land on the same number, because the rubric tells them exactly what each score means.
Three Rules for Rubrics That Hold Up
- Separate the criteria. Score accuracy, completeness, and tone independently. If you mash them into one "quality" number, a beautifully written but factually wrong output gets a confusingly middling score and you can't see why.
- Anchor every level. Each point on the scale needs a description of what earns it. "3 out of 5" means nothing; "all claims supported, missing one action item" means something.
- Make the failure modes explicit. Put the things you most want to avoid (a hallucinated fact, a missed deadline) into the lowest level of a criterion so they are penalized hard and consistently.
Turning a Rubric Into a Prompt
You can hand the rubric to an AI judge, which you will do in the next lesson. To do that well, the rubric has to be written as instructions, not as a private note to yourself. Here is a rubric expressed as a scoring prompt.
Combining Criteria Into One Number
Once you score several criteria, you usually want one number to compare prompt versions. Two common ways:
- Sum or weighted sum. Add the criteria, optionally weighting the ones you care about most (accuracy might count double). This gives a single comparable score per output, and you average across the eval set.
- Pass rate on a bar. Define "acceptable" as a threshold (for example, total of at least 4 and accuracy of at least 1), then report the percentage of cases that clear the bar. This is often more decision-useful than an average, because it answers "how often is this prompt good enough to ship?"
Both are valid. Pick one and stay consistent so your numbers stay comparable across versions.
A Quick Diagnostic
Look at any rubric and ask: if I gave this rubric and the same output to two careful people, would they produce the same score? If the honest answer is "probably not," the rubric is too vague. Add anchored descriptions until the answer is yes.
Key Takeaways
- Choose the metric family that fits: exact checks, rubric scoring, or preference comparison.
- A good rubric separates criteria, anchors every scale level with a description, and explicitly penalizes the worst failure modes.
- Vague "rate quality 1 to 5" rubrics produce noisy scores that mislead you.
- Combine criteria into one comparable number with a weighted sum or a pass-rate threshold, and keep the method consistent.
- The test of a rubric is reproducibility: two careful graders should reach the same score.

