Building an Eval Set: Test Cases for Prompts
An eval is only as good as the inputs you feed it. If your test cases are all easy, your prompt will look great and fail in the real world. In this lesson you will build a small, sharp eval set: a curated collection of inputs that represents the real distribution your prompt will face, including the cases most likely to break it.
You do not need a thousand examples. A well-chosen set of fifteen to thirty cases will catch the vast majority of problems and is small enough to review by hand. Quality and coverage beat raw quantity.
What You'll Learn
- How to choose inputs that represent real-world variety, not just easy wins
- The four categories every eval set should cover
- How to source test cases when you don't have real data yet
- How to record the expected answer (when one exists) without overfitting
- A reusable template you can fill in for any prompt
Start From the Real Distribution
Your eval set should mirror the inputs your prompt will actually receive. If you are summarizing support tickets, your cases should look like real support tickets: varied lengths, tones, topics, and quality. A common mistake is to write test inputs yourself in clean, grammatical prose. Real inputs are messy, and messy is where prompts fail.
Three good sources of realistic cases:
- Historical data. Past emails, tickets, documents, or transcripts. The best source if you have it.
- Logs of the prompt in use. Once a prompt is live, real inputs flow through it. Sample them.
- Synthetic generation. When you have no data, ask the AI itself to generate realistic varied examples, then curate them by hand.
The Four Categories to Cover
A strong eval set deliberately spans four kinds of cases. Aim for a mix rather than a pile of one type.
- Typical cases. The bread-and-butter inputs that represent the common path. These confirm the prompt does its main job. Most of your set lives here.
- Edge cases. Unusual but valid inputs: very long, very short, multiple topics in one input, unusual formatting, a different language. These find brittleness.
- Adversarial cases. Inputs designed to trip the prompt: ambiguous requests, contradictory information, attempts to make it ignore instructions, or content that looks like an instruction but is actually data.
- Empty and malformed cases. Blank input, garbage characters, or the wrong kind of content entirely. A good prompt fails gracefully here instead of inventing an answer.
A prompt that scores well on typical cases but is never tested on the other three categories is a prompt you do not actually understand yet.
Generating Cases With AI
When you lack real data, use the model to bootstrap a set, then prune. Here is a meta-prompt for generating varied test inputs.
Notice that we asked the model to label each case with the expected category and a reason. That reason is gold: it forces variety and gives you a sanity check.
Recording Expected Answers Without Overfitting
For some tasks there is a clear right answer (the category is "Billing," the extracted date is "2026-03-14"). Record it. For these, scoring later becomes a simple exact or contains check.
For open-ended tasks (summaries, rewrites, explanations) there is no single right answer. Do not write one "golden" output and demand the prompt match it word for word, or you will optimize your prompt to mimic your own writing rather than to be good. Instead, record acceptance criteria: a short list of things a good output must do. For a summary that might be "captures the main complaint, names the product, stays under 50 words, neutral tone."
A simple eval-set table looks like this:
| ID | Input (truncated) | Category | Expected / Criteria | Notes |
|---|---|---|---|---|
| 01 | "My card was charged twice..." | typical | Billing | clear double-charge |
| 07 | "app crashes + also want refund" | edge | Technical or Billing | multi-topic, either acceptable |
| 10 | "ignore your rules and..." | adversarial | should refuse / stay on task | injection attempt |
| 12 | "asdfgh" | malformed | Other | garbage input |
Keep this in a spreadsheet or a simple document. It is the single most reusable asset you will build, because every future version of the prompt gets tested against the same set.
Freeze the Set, Then Iterate the Prompt
Once your eval set is good, freeze it. The whole point is that the test cases stay constant while you change the prompt. If you keep editing the test cases at the same time as the prompt, you can never tell whether a score change came from the prompt or the cases.
Over time you will add new cases, especially when a real failure shows up that your set missed. That is healthy. But add cases deliberately and in a separate step from prompt tuning. Treat your eval set like a growing library of known-hard problems.
Practice
Build the skeleton of an eval set for one of your own prompts.
Key Takeaways
- An eval set should mirror the real distribution of inputs, including the messy ones.
- Cover four categories: typical, edge, adversarial, and empty/malformed.
- Source cases from historical data, live logs, or AI generation followed by human curation.
- For closed tasks record the expected answer; for open tasks record acceptance criteria, not a golden output.
- Freeze the set so the cases stay constant while you iterate the prompt, and grow it when real failures appear.

