Iterating on Failures: Vague Specs, Hallucination, Scope Creep
Verification will fail sometimes. That is the system working, not breaking. A failed check means you caught a problem before it shipped. The skill is reacting to the failure correctly: diagnosing why the output is wrong and fixing the right thing, instead of blindly re-prompting until the symptoms go away.
This lesson is the fourth step of the loop — iterate — done deliberately. The single most important question when a check fails is: was the spec wrong, or was the output wrong? Answer that first, every time.
What You'll Learn
- The diagnostic question that drives every iteration
- How to fix a vague spec instead of patching its symptoms
- How to recognize and recover from agent hallucination
- How to stop scope creep before it compounds
- When to re-prompt, when to reset, and when to step in
The diagnostic fork
When a criterion fails verification, do not immediately tell the agent "fix it." First decide which of two things is broken.
Decision
A verification check failed. Why?
- If The spec was clear and the output disobeys it
Output is wrong — re-prompt with the specific failing criterion
The agent misimplemented a well-defined requirement
- If The output is reasonable given what you wrote
Spec is wrong — fix the spec, then re-run
You under-specified; the agent guessed and guessed differently than you meant
This fork matters because the fixes are opposite. If the spec was unclear, telling the agent "fix it" just invites a different guess. If the spec was clear and the agent disobeyed, sharpening the spec wastes effort on something already correct. Diagnose first.
Failure mode 1: the vague spec
The most common failure is that verification reveals your spec was ambiguous. The agent did something defensible; it just is not what you wanted. This is good news — you found the ambiguity in a test instead of in production.
The fix is to update the spec, not just the prompt. Add the missing criterion or edge case, then re-run the loop. If you only patch the immediate output ("no, make empty strings return false"), the ambiguity still lives in your spec, and it will bite again the next time you touch this code or hand the spec to a different agent.
Example: your discount spec said "code is case-insensitive" but verification shows applyCode(" SAVE10 ") with surrounding spaces fails. The agent handled case but not whitespace. The spec was silent on trimming. Do not just say "also trim it" — add the edge case to the spec:
Edge cases:
- Code is matched case-insensitively AND after trimming surrounding whitespace.
Now the spec is complete, the test exists, and the requirement survives the next change.
Failure mode 2: hallucination
Agents sometimes invent things: a function that does not exist, an API parameter that was never real, an import from a package you do not have. This is hallucination — confident output that references something fictional.
Spec-driven development reduces hallucination because a tight spec gives the agent less room to invent, but it does not eliminate it. How to catch and recover:
- Verification catches the runtime ones. Code that calls a nonexistent function fails when you run the tests. This is another reason "it runs" matters — hallucinated APIs usually do not run.
- The human read catches the plausible ones. An invented-but-real-looking config option or a subtly wrong library method may not crash. Scan for unfamiliar calls and confirm they exist.
- Ground the agent in reality. When you re-prompt, point at the truth: "There is no
cart.applyAll()method. The available methods are incart.ts— read it and use only what exists." Naming the real source beats saying "that's wrong." - Constrain dependencies. A "no new dependencies" constraint turns a hallucinated import into an obvious, easy-to-spot violation.
Do not argue with a hallucination in prose. Give the agent the real artifact — the actual file, the real API docs, the genuine type signature — and let it correct against ground truth.
Failure mode 3: scope creep
Scope creep is the agent doing more than the spec asked. It refactors a neighboring function, adds a caching layer "for performance," introduces an abstraction for a future that may never come, or pulls in a dependency to save five lines. Each addition is unrequested, untested by your criteria, and a place for surprises to hide.
The human-read step of verification is where you catch it: anything in the diff that satisfies no criterion is suspect. To prevent it in the first place:
- State the scope constraint up front. "Do the simplest thing that satisfies the criteria. No speculative abstractions. Only modify these files."
- Reject out-of-scope additions even when they look helpful. A caching layer you did not ask for is a caching layer you did not test, did not review for correctness, and now have to maintain. Out it goes.
- If the addition is genuinely good, make it its own spec. Do not let it ride along untested inside another change. Pull it out, write criteria for it, run the loop.
The discipline here pays compounding interest. Every accepted out-of-scope change makes the next diff harder to review and the next failure harder to diagnose.
Re-prompt, reset, or step in
When you do go back to the agent, you have three moves. Pick by how far off the output is.
- Re-prompt when the output is close and the issue is specific. Point at the exact failing criterion: "Criterion 3 fails —
applyCode(' ')should return the unchanged total, but it threw. Fix only that." Tight, surgical, keeps the working parts. - Reset when the output is tangled or the agent has gone down a wrong path and keeps patching around it. Start a fresh session with the corrected spec. A clean context with a better spec often beats ten rounds of arguing with a confused one. This is cheaper than it feels.
- Step in when the problem is one you can fix faster by hand than by explaining — which the next lesson is entirely about. Iteration has a budget; spending it is fine, but recognize when you have spent enough.
A practical rule of thumb: if you have re-prompted the same issue three times without progress, stop. Either your spec is still wrong (go fix it and reset) or this is a hand-code job.
Iteration improves the spec, not just the code
The quiet payoff of disciplined iteration: your spec gets better every time. Each failed check that traces back to an ambiguity becomes a new criterion or edge case. Over a project, your specs become sharper, your agents need fewer rounds, and the same spec produces reliable results across different tools. You are not just fixing this output; you are building a reusable, battle-tested contract.
Key Takeaways
- When a check fails, first diagnose: was the spec wrong or the output wrong? The fixes are opposite.
- For a vague spec, fix the spec — add the missing criterion or edge case — not just the immediate output.
- Catch hallucination with verification and a human read; recover by grounding the agent in the real artifact, not by arguing in prose.
- Stop scope creep at the human-read step; reject unrequested additions or promote the good ones to their own spec.
- Choose your move by distance: re-prompt for close-and-specific, reset with a corrected spec for tangled, step in when hand-coding is faster.
- Disciplined iteration improves the spec itself, so future runs need fewer rounds.

