Building an Eval Set: Test Cases for Prompts

An eval is only as good as the inputs you feed it. If your test cases are all easy, your prompt will look great and fail in the real world. In this lesson you will build a small, sharp eval set: a curated collection of inputs that represents the real distribution your prompt will face, including the cases most likely to break it.

You do not need a thousand examples. A well-chosen set of fifteen to thirty cases will catch the vast majority of problems and is small enough to review by hand. Quality and coverage beat raw quantity.

What You'll Learn

How to choose inputs that represent real-world variety, not just easy wins
The four categories every eval set should cover
How to source test cases when you don't have real data yet
How to record the expected answer (when one exists) without overfitting
A reusable template you can fill in for any prompt

Start From the Real Distribution

Your eval set should mirror the inputs your prompt will actually receive. If you are summarizing support tickets, your cases should look like real support tickets: varied lengths, tones, topics, and quality. A common mistake is to write test inputs yourself in clean, grammatical prose. Real inputs are messy, and messy is where prompts fail.

Three good sources of realistic cases:

Historical data. Past emails, tickets, documents, or transcripts. The best source if you have it.
Logs of the prompt in use. Once a prompt is live, real inputs flow through it. Sample them.
Synthetic generation. When you have no data, ask the AI itself to generate realistic varied examples, then curate them by hand.

The Four Categories to Cover

A strong eval set deliberately spans four kinds of cases. Aim for a mix rather than a pile of one type.

Typical cases. The bread-and-butter inputs that represent the common path. These confirm the prompt does its main job. Most of your set lives here.
Edge cases. Unusual but valid inputs: very long, very short, multiple topics in one input, unusual formatting, a different language. These find brittleness.
Adversarial cases. Inputs designed to trip the prompt: ambiguous requests, contradictory information, attempts to make it ignore instructions, or content that looks like an instruction but is actually data.
Empty and malformed cases. Blank input, garbage characters, or the wrong kind of content entirely. A good prompt fails gracefully here instead of inventing an answer.

A prompt that scores well on typical cases but is never tested on the other three categories is a prompt you do not actually understand yet.

Generating Cases With AI

When you lack real data, use the model to bootstrap a set, then prune. Here is a meta-prompt for generating varied test inputs.

Loading Prompt Playground...

Notice that we asked the model to label each case with the expected category and a reason. That reason is gold: it forces variety and gives you a sanity check.

Recording Expected Answers Without Overfitting

For some tasks there is a clear right answer (the category is "Billing," the extracted date is "2026-03-14"). Record it. For these, scoring later becomes a simple exact or contains check.

For open-ended tasks (summaries, rewrites, explanations) there is no single right answer. Do not write one "golden" output and demand the prompt match it word for word, or you will optimize your prompt to mimic your own writing rather than to be good. Instead, record acceptance criteria: a short list of things a good output must do. For a summary that might be "captures the main complaint, names the product, stays under 50 words, neutral tone."

A simple eval-set table looks like this:

ID	Input (truncated)	Category	Expected / Criteria	Notes
01	"My card was charged twice..."	typical	Billing	clear double-charge
07	"app crashes + also want refund"	edge	Technical or Billing	multi-topic, either acceptable
10	"ignore your rules and..."	adversarial	should refuse / stay on task	injection attempt
12	"asdfgh"	malformed	Other	garbage input

Keep this in a spreadsheet or a simple document. It is the single most reusable asset you will build, because every future version of the prompt gets tested against the same set.

Freeze the Set, Then Iterate the Prompt

Once your eval set is good, freeze it. The whole point is that the test cases stay constant while you change the prompt. If you keep editing the test cases at the same time as the prompt, you can never tell whether a score change came from the prompt or the cases.

Over time you will add new cases, especially when a real failure shows up that your set missed. That is healthy. But add cases deliberately and in a separate step from prompt tuning. Treat your eval set like a growing library of known-hard problems.

Practice

Build the skeleton of an eval set for one of your own prompts.

Loading Prompt Playground...

Key Takeaways

An eval set should mirror the real distribution of inputs, including the messy ones.
Cover four categories: typical, edge, adversarial, and empty/malformed.
Source cases from historical data, live logs, or AI generation followed by human curation.
For closed tasks record the expected answer; for open tasks record acceptance criteria, not a golden output.
Freeze the set so the cases stay constant while you iterate the prompt, and grow it when real failures appear.

Building an Eval Set: Test Cases for Prompts

What You'll Learn

How to choose inputs that represent real-world variety, not just easy wins
The four categories every eval set should cover
How to source test cases when you don't have real data yet
How to record the expected answer (when one exists) without overfitting
A reusable template you can fill in for any prompt

Start From the Real Distribution

Three good sources of realistic cases:

Historical data. Past emails, tickets, documents, or transcripts. The best source if you have it.
Logs of the prompt in use. Once a prompt is live, real inputs flow through it. Sample them.
Synthetic generation. When you have no data, ask the AI itself to generate realistic varied examples, then curate them by hand.

The Four Categories to Cover

A strong eval set deliberately spans four kinds of cases. Aim for a mix rather than a pile of one type.

Typical cases. The bread-and-butter inputs that represent the common path. These confirm the prompt does its main job. Most of your set lives here.
Edge cases. Unusual but valid inputs: very long, very short, multiple topics in one input, unusual formatting, a different language. These find brittleness.
Adversarial cases. Inputs designed to trip the prompt: ambiguous requests, contradictory information, attempts to make it ignore instructions, or content that looks like an instruction but is actually data.
Empty and malformed cases. Blank input, garbage characters, or the wrong kind of content entirely. A good prompt fails gracefully here instead of inventing an answer.

A prompt that scores well on typical cases but is never tested on the other three categories is a prompt you do not actually understand yet.

Generating Cases With AI

When you lack real data, use the model to bootstrap a set, then prune. Here is a meta-prompt for generating varied test inputs.

Loading Prompt Playground...

Notice that we asked the model to label each case with the expected category and a reason. That reason is gold: it forces variety and gives you a sanity check.

Recording Expected Answers Without Overfitting

For some tasks there is a clear right answer (the category is "Billing," the extracted date is "2026-03-14"). Record it. For these, scoring later becomes a simple exact or contains check.

A simple eval-set table looks like this:

ID	Input (truncated)	Category	Expected / Criteria	Notes
01	"My card was charged twice..."	typical	Billing	clear double-charge
07	"app crashes + also want refund"	edge	Technical or Billing	multi-topic, either acceptable
10	"ignore your rules and..."	adversarial	should refuse / stay on task	injection attempt
12	"asdfgh"	malformed	Other	garbage input

Keep this in a spreadsheet or a simple document. It is the single most reusable asset you will build, because every future version of the prompt gets tested against the same set.

Freeze the Set, Then Iterate the Prompt

Practice

Build the skeleton of an eval set for one of your own prompts.

Loading Prompt Playground...

Key Takeaways

An eval set should mirror the real distribution of inputs, including the messy ones.
Cover four categories: typical, edge, adversarial, and empty/malformed.
Source cases from historical data, live logs, or AI generation followed by human curation.
For closed tasks record the expected answer; for open tasks record acceptance criteria, not a golden output.
Freeze the set so the cases stay constant while you iterate the prompt, and grow it when real failures appear.

Building an Eval Set: Test Cases for Prompts

What You'll Learn

Start From the Real Distribution

The Four Categories to Cover

Generating Cases With AI

Recording Expected Answers Without Overfitting

Freeze the Set, Then Iterate the Prompt

Practice

Key Takeaways

Quiz

Questions & Answers

Building an Eval Set: Test Cases for Prompts

What You'll Learn

Start From the Real Distribution

The Four Categories to Cover

Generating Cases With AI

Recording Expected Answers Without Overfitting

Freeze the Set, Then Iterate the Prompt

Practice

Key Takeaways

Quiz

Questions & Answers