Refinement Loops: Improving a Prompt With Its Own Failures

The most powerful way to improve a prompt is to feed it its own failures. A refinement loop is a tight cycle: run the prompt on your eval set, find the cases it gets wrong, understand why, change one thing, and re-run. Each turn of the loop is guided by evidence, not by hunches. This is where evaluation and meta-prompting combine into a real optimization process.

This lesson gives you the loop as a concrete, repeatable procedure you can run by hand in a chat window or scale up with tooling later.

What You'll Learn

The five steps of a refinement loop
Why you change one thing at a time
How to read failures for their root cause instead of patching symptoms
How to use the model to propose the next fix without losing control
When to stop iterating

The Loop

A refinement loop has five steps that repeat:

Run the current prompt against your frozen eval set.
Score every output with your rubric or checks, and record the aggregate.
Inspect failures. Pull the lowest-scoring cases and read them.
Hypothesize and change. Form one hypothesis about why they fail, make one targeted change to the prompt.
Re-run and compare. Score again. If the number went up, keep the change. If it went down or stayed flat, revert and try a different hypothesis.

That is the whole engine. Its power comes from discipline: a frozen eval set, one change per turn, and a recorded score every time.

Change One Thing at a Time

It is tempting to fix five things at once. Resist it. If you change the role, add two examples, and rewrite the output format in a single turn, and the score goes up, you have no idea which change helped. Worse, one change might have helped while another hurt, and they cancel out, so you discard a good idea.

Change one variable, re-run, observe. This is slower per turn but far faster overall, because you build real knowledge of what moves your score. Keep a simple log:

Version	Change made	Score	Keep?
v1	baseline	68%	-
v2	added expert role	71%	yes
v3	added 2 examples	79%	yes
v4	stricter output format	77%	revert
v5	reworded the failing edge case rule	84%	yes

This log is the record of what actually works for your task. It is also how you explain your choices to a teammate.

Read Failures for Root Cause

The high-value skill in the loop is diagnosis. When a case fails, do not patch the symptom; find the cause. Ask:

Is it an instruction problem? The prompt is ambiguous or missing a rule, so the model guessed.
Is it a context problem? The model lacked information it needed and filled the gap by inventing.
Is it a format problem? The answer was right but in a shape your check could not accept.
Is it a capability problem? The task is genuinely hard and the model cannot do it reliably no matter how you phrase it.

The fix is different for each. An instruction problem needs a clearer rule. A context problem needs more input or a retrieval step. A format problem needs a tighter output spec. A capability problem needs decomposition into smaller steps, or a different approach entirely. Patching a capability problem with prompt wording wastes loops.

Let the Model Propose the Fix

You can put the model inside the loop. Show it the failing cases and ask it to propose a single targeted change. You stay in control by deciding whether to accept the proposal and by measuring the result.

Loading Prompt Playground...

Asking for the single most common root cause and exactly one change keeps the loop clean. You then run the revised prompt and check whether your score rose.

A Worked Mini-Example

Suppose a prompt that extracts a delivery date from order emails scores 72%. You inspect the eight failures. Six of them are emails where the date is written as "next Tuesday" rather than an explicit date, and the model returns the literal phrase instead of a resolved date. That is an instruction problem: the prompt never said how to handle relative dates.

One targeted change: add the rule "If the date is relative (such as 'next Tuesday'), resolve it to an absolute date using the email's send date, and if the send date is unknown, return null." Re-run. Score jumps to 89%. You keep it, log it, and move to the next failure cluster.

Notice what made this work: you did not guess. You read the failures, saw a pattern, named the root cause, and made one change aimed at it.

When to Stop

The loop has diminishing returns. Stop when:

Your score clears the bar you set for shipping, or
Two or three consecutive turns fail to move the number, suggesting you have hit a capability ceiling, or
The remaining failures are genuinely ambiguous cases where even a human would disagree on the right answer.

Chasing the last few percent on hard, ambiguous cases often costs more than it is worth. Ship when the prompt reliably clears your bar, and add any new real-world failures to the eval set as they appear.

Loading Prompt Playground...

Key Takeaways

A refinement loop is: run, score, inspect failures, change one thing, re-run and compare.
Change a single variable per turn so you know what moved the score, and log every version.
Diagnose root cause (instruction, context, format, or capability) instead of patching symptoms.
You can let the model propose a single targeted fix, but you decide and you measure.
Stop when you clear your bar, the score stalls, or the remaining cases are genuinely ambiguous.

Refinement Loops: Improving a Prompt With Its Own Failures

This lesson gives you the loop as a concrete, repeatable procedure you can run by hand in a chat window or scale up with tooling later.

What You'll Learn

The five steps of a refinement loop
Why you change one thing at a time
How to read failures for their root cause instead of patching symptoms
How to use the model to propose the next fix without losing control
When to stop iterating

The Loop

A refinement loop has five steps that repeat:

Run the current prompt against your frozen eval set.
Score every output with your rubric or checks, and record the aggregate.
Inspect failures. Pull the lowest-scoring cases and read them.
Hypothesize and change. Form one hypothesis about why they fail, make one targeted change to the prompt.
Re-run and compare. Score again. If the number went up, keep the change. If it went down or stayed flat, revert and try a different hypothesis.

That is the whole engine. Its power comes from discipline: a frozen eval set, one change per turn, and a recorded score every time.

Change One Thing at a Time

Change one variable, re-run, observe. This is slower per turn but far faster overall, because you build real knowledge of what moves your score. Keep a simple log:

Version	Change made	Score	Keep?
v1	baseline	68%	-
v2	added expert role	71%	yes
v3	added 2 examples	79%	yes
v4	stricter output format	77%	revert
v5	reworded the failing edge case rule	84%	yes

This log is the record of what actually works for your task. It is also how you explain your choices to a teammate.

Read Failures for Root Cause

The high-value skill in the loop is diagnosis. When a case fails, do not patch the symptom; find the cause. Ask:

Is it an instruction problem? The prompt is ambiguous or missing a rule, so the model guessed.
Is it a context problem? The model lacked information it needed and filled the gap by inventing.
Is it a format problem? The answer was right but in a shape your check could not accept.
Is it a capability problem? The task is genuinely hard and the model cannot do it reliably no matter how you phrase it.

Let the Model Propose the Fix

Loading Prompt Playground...

Asking for the single most common root cause and exactly one change keeps the loop clean. You then run the revised prompt and check whether your score rose.

A Worked Mini-Example

Notice what made this work: you did not guess. You read the failures, saw a pattern, named the root cause, and made one change aimed at it.

When to Stop

The loop has diminishing returns. Stop when:

Your score clears the bar you set for shipping, or
Two or three consecutive turns fail to move the number, suggesting you have hit a capability ceiling, or
The remaining failures are genuinely ambiguous cases where even a human would disagree on the right answer.

Loading Prompt Playground...

Key Takeaways

A refinement loop is: run, score, inspect failures, change one thing, re-run and compare.
Change a single variable per turn so you know what moved the score, and log every version.
Diagnose root cause (instruction, context, format, or capability) instead of patching symptoms.
You can let the model propose a single targeted fix, but you decide and you measure.
Stop when you clear your bar, the score stalls, or the remaining cases are genuinely ambiguous.

Refinement Loops: Improving a Prompt With Its Own Failures

What You'll Learn

The Loop

Change One Thing at a Time

Read Failures for Root Cause

Let the Model Propose the Fix

A Worked Mini-Example

When to Stop

Key Takeaways

Quiz

Questions & Answers

Refinement Loops: Improving a Prompt With Its Own Failures

What You'll Learn

The Loop

Change One Thing at a Time

Read Failures for Root Cause

Let the Model Propose the Fix

A Worked Mini-Example

When to Stop

Key Takeaways

Quiz

Questions & Answers