Refinement Loops: Improving a Prompt With Its Own Failures
The most powerful way to improve a prompt is to feed it its own failures. A refinement loop is a tight cycle: run the prompt on your eval set, find the cases it gets wrong, understand why, change one thing, and re-run. Each turn of the loop is guided by evidence, not by hunches. This is where evaluation and meta-prompting combine into a real optimization process.
This lesson gives you the loop as a concrete, repeatable procedure you can run by hand in a chat window or scale up with tooling later.
What You'll Learn
- The five steps of a refinement loop
- Why you change one thing at a time
- How to read failures for their root cause instead of patching symptoms
- How to use the model to propose the next fix without losing control
- When to stop iterating
The Loop
A refinement loop has five steps that repeat:
- Run the current prompt against your frozen eval set.
- Score every output with your rubric or checks, and record the aggregate.
- Inspect failures. Pull the lowest-scoring cases and read them.
- Hypothesize and change. Form one hypothesis about why they fail, make one targeted change to the prompt.
- Re-run and compare. Score again. If the number went up, keep the change. If it went down or stayed flat, revert and try a different hypothesis.
That is the whole engine. Its power comes from discipline: a frozen eval set, one change per turn, and a recorded score every time.
Change One Thing at a Time
It is tempting to fix five things at once. Resist it. If you change the role, add two examples, and rewrite the output format in a single turn, and the score goes up, you have no idea which change helped. Worse, one change might have helped while another hurt, and they cancel out, so you discard a good idea.
Change one variable, re-run, observe. This is slower per turn but far faster overall, because you build real knowledge of what moves your score. Keep a simple log:
| Version | Change made | Score | Keep? |
|---|---|---|---|
| v1 | baseline | 68% | - |
| v2 | added expert role | 71% | yes |
| v3 | added 2 examples | 79% | yes |
| v4 | stricter output format | 77% | revert |
| v5 | reworded the failing edge case rule | 84% | yes |
This log is the record of what actually works for your task. It is also how you explain your choices to a teammate.
Read Failures for Root Cause
The high-value skill in the loop is diagnosis. When a case fails, do not patch the symptom; find the cause. Ask:
- Is it an instruction problem? The prompt is ambiguous or missing a rule, so the model guessed.
- Is it a context problem? The model lacked information it needed and filled the gap by inventing.
- Is it a format problem? The answer was right but in a shape your check could not accept.
- Is it a capability problem? The task is genuinely hard and the model cannot do it reliably no matter how you phrase it.
The fix is different for each. An instruction problem needs a clearer rule. A context problem needs more input or a retrieval step. A format problem needs a tighter output spec. A capability problem needs decomposition into smaller steps, or a different approach entirely. Patching a capability problem with prompt wording wastes loops.
Let the Model Propose the Fix
You can put the model inside the loop. Show it the failing cases and ask it to propose a single targeted change. You stay in control by deciding whether to accept the proposal and by measuring the result.
Asking for the single most common root cause and exactly one change keeps the loop clean. You then run the revised prompt and check whether your score rose.
A Worked Mini-Example
Suppose a prompt that extracts a delivery date from order emails scores 72%. You inspect the eight failures. Six of them are emails where the date is written as "next Tuesday" rather than an explicit date, and the model returns the literal phrase instead of a resolved date. That is an instruction problem: the prompt never said how to handle relative dates.
One targeted change: add the rule "If the date is relative (such as 'next Tuesday'), resolve it to an absolute date using the email's send date, and if the send date is unknown, return null." Re-run. Score jumps to 89%. You keep it, log it, and move to the next failure cluster.
Notice what made this work: you did not guess. You read the failures, saw a pattern, named the root cause, and made one change aimed at it.
When to Stop
The loop has diminishing returns. Stop when:
- Your score clears the bar you set for shipping, or
- Two or three consecutive turns fail to move the number, suggesting you have hit a capability ceiling, or
- The remaining failures are genuinely ambiguous cases where even a human would disagree on the right answer.
Chasing the last few percent on hard, ambiguous cases often costs more than it is worth. Ship when the prompt reliably clears your bar, and add any new real-world failures to the eval set as they appear.
Key Takeaways
- A refinement loop is: run, score, inspect failures, change one thing, re-run and compare.
- Change a single variable per turn so you know what moved the score, and log every version.
- Diagnose root cause (instruction, context, format, or capability) instead of patching symptoms.
- You can let the model propose a single targeted fix, but you decide and you measure.
- Stop when you clear your bar, the score stalls, or the remaining cases are genuinely ambiguous.

