Verification: Testing Output Against the Spec, Not Vibes

This is the lesson that separates spec-driven development from fancier prompting. Writing a good spec and structuring a good prompt get you a plausible result. Verification gets you a correct result. Without it, you are back to vibe-checking the agent's work, and a confident agent will sail a subtle bug right past a confident reviewer.

The rule is simple: every acceptance criterion in your spec becomes a check on the output. You do not ask "does this look right?" You ask "does this satisfy criterion 3?" — and you answer with evidence.

What You'll Learn

Why "it runs" is the most dangerous false positive
How acceptance criteria map one-to-one onto tests
Who should write the tests, and why it usually is not the implementer
How to read the agent's own test results skeptically
A verification checklist you can apply to any output

"It runs" is not "it works"

The most expensive mistake in agent-assisted coding is treating "the code runs without error" as "the code is correct." Running code that does the wrong thing is worse than code that crashes, because the crash at least tells you something is wrong. Silent incorrectness ships.

Agents are exceptionally good at producing code that runs. That is exactly why you cannot use "it runs" as your bar. Your bar is the spec, and the spec is a list of specific, checkable claims. Verification is the act of confirming each claim with evidence rather than impression.

Criteria map onto tests

Here is the mechanical heart of verification: each acceptance criterion is a test case. You wrote criteria in the checkable form "input X produces output Y" precisely so this mapping would be trivial.

Take the discount-code spec from earlier. Each criterion becomes an assertion:

Criterion: applyCode("SAVE10") reduces subtotal by 10%
  -> test: cart with subtotal 100, applyCode("SAVE10") => total 90

Criterion: unknown code returns unchanged total + { applied: false }
  -> test: applyCode("NOPE") => total unchanged, applied === false

Criterion: empty/whitespace code treated as unknown
  -> test: applyCode("   ") => total unchanged, applied === false

Criterion: code is case-insensitive
  -> test: applyCode("save10") => same result as applyCode("SAVE10")

Criterion: a second code replaces the first
  -> test: applyCode("SAVE10") then applyCode("SAVE20") => 20% off only

When every criterion has a passing test, the output is verified — not "looks good", verified. When a criterion has no test, that criterion is unverified, regardless of how confident anyone feels.

Vibes versus tests

The contrast is worth making explicit, because the pull toward vibe-checking is strong when the code looks clean.

Why tests beat impressions for verifying agent output

Why tests beat impressions for verifying agent output
Criteria	Vibe-checking	Spec verification
The question	Does this look right?	Does it satisfy each criterion?
The evidence	A quick read, a manual click	A passing test per criterion
Catches edge cases	Rarely — happy path only	Yes — each edge case is a test
Survives a refactor	No — must re-check by hand	Yes — tests re-run automatically
Scales with the project	Degrades as complexity grows	Holds; the suite grows with it

Vibe-checking

The question: Does this look right?
The evidence: A quick read, a manual click
Catches edge cases: Rarely — happy path only
Survives a refactor: No — must re-check by hand
Scales with the project: Degrades as complexity grows

Spec verification

The question: Does it satisfy each criterion?
The evidence: A passing test per criterion
Catches edge cases: Yes — each edge case is a test
Survives a refactor: Yes — tests re-run automatically
Scales with the project: Holds; the suite grows with it

Vibe-checking is not worthless — a human read still catches things tests miss, like a confusing API or a security smell. But it is a supplement, not the verification itself.

Who writes the tests

A subtle trap: if the same agent writes both the implementation and the tests in one pass, the tests encode the agent's interpretation, not your spec. The agent's tests will pass because they test what the agent built, which may not be what you asked for. The bug and its test agree with each other.

Three ways to avoid this, roughly in order of rigor:

Write the tests yourself, from the spec. Most reliable. Your tests come straight from your criteria and are blind to the implementation. This is also a great use of the spec — you already did the thinking.
Have a separate agent or a fresh session write tests from the spec only. Give it the spec, not the implementation. Fresh context, no knowledge of how the code works, tests the contract.
Have the implementing agent write tests, then review each test against your spec. Lowest effort, lowest assurance. Read each test and confirm it actually checks a criterion, not just that the code does what the code does.

The principle underneath all three: the verification must be derived from the spec, independently of the implementation. Tests written by looking at the code only prove the code does what it does.

Reading the agent's results skeptically

Agents will often run the tests and report back "All tests pass." Treat that as a claim to audit, not a conclusion to accept. A few habits:

Check the test count. "All 3 tests pass" when your spec had eight criteria means five criteria are unverified.
Make sure a failing test can fail. A test with no real assertion, or one that asserts the code's actual behavior rather than the spec's expected behavior, passes meaninglessly. Glance at what each test actually checks.
Insist on real output. Ask the agent to paste the actual test runner output, not a summary. "Tests pass" is a summary; the runner's output is evidence.
Re-run them yourself when it matters. For anything important, run the suite in your own environment. It takes seconds and removes the agent from the trust chain entirely.

The verification checklist

Apply this to any agent output before you accept it:

List the criteria. Pull each acceptance criterion and edge case from the spec.
Map each to a check. Confirm there is a test (or a manual check, for the few things tests cannot cover) for every one.
Confirm the tests are spec-derived. They check expected behavior from the spec, not the implementation's actual behavior.
Run them and read real output. A green suite with the right number of meaningful tests.
Add a human read. Scan for security, clarity, and anything outside the spec that the agent changed.
Resolve gaps. Any unverified criterion goes back into the loop — which is the next lesson.

A note on scope drift

Verification also catches a quiet failure mode: the agent did more than you asked. Maybe it "helpfully" added a logging dependency or refactored a neighboring function. Step five of the checklist — the human read — is where you catch changes that satisfy no criterion. Code that no criterion asked for is code no test covers, and it is exactly where surprises hide. If it is not in the spec, question why it is in the diff.

Key Takeaways

"It runs" is a false positive; verify against the spec, not against impressions.
Each acceptance criterion and edge case maps to a test case — that is why you wrote them checkably.
Verification must be derived from the spec independently of the implementation, or the tests just confirm the code does what it does.
Prefer writing tests yourself or with a spec-only fresh session over letting the implementer test itself.
Audit the agent's "all tests pass" claim: check the count, the assertions, and the real output, and re-run when it matters.
A human read catches scope drift — changes that satisfy no criterion are where surprises hide.

Verification: Testing Output Against the Spec, Not Vibes

What You'll Learn

Why "it runs" is the most dangerous false positive
How acceptance criteria map one-to-one onto tests
Who should write the tests, and why it usually is not the implementer
How to read the agent's own test results skeptically
A verification checklist you can apply to any output

"It runs" is not "it works"

Criteria map onto tests

Take the discount-code spec from earlier. Each criterion becomes an assertion:

Criterion: applyCode("SAVE10") reduces subtotal by 10%
  -> test: cart with subtotal 100, applyCode("SAVE10") => total 90

Criterion: unknown code returns unchanged total + { applied: false }
  -> test: applyCode("NOPE") => total unchanged, applied === false

Criterion: empty/whitespace code treated as unknown
  -> test: applyCode("   ") => total unchanged, applied === false

Criterion: code is case-insensitive
  -> test: applyCode("save10") => same result as applyCode("SAVE10")

Criterion: a second code replaces the first
  -> test: applyCode("SAVE10") then applyCode("SAVE20") => 20% off only

When every criterion has a passing test, the output is verified — not "looks good", verified. When a criterion has no test, that criterion is unverified, regardless of how confident anyone feels.

Vibes versus tests

The contrast is worth making explicit, because the pull toward vibe-checking is strong when the code looks clean.

Why tests beat impressions for verifying agent output

Why tests beat impressions for verifying agent output
Criteria	Vibe-checking	Spec verification
The question	Does this look right?	Does it satisfy each criterion?
The evidence	A quick read, a manual click	A passing test per criterion
Catches edge cases	Rarely — happy path only	Yes — each edge case is a test
Survives a refactor	No — must re-check by hand	Yes — tests re-run automatically
Scales with the project	Degrades as complexity grows	Holds; the suite grows with it

Vibe-checking

The question: Does this look right?
The evidence: A quick read, a manual click
Catches edge cases: Rarely — happy path only
Survives a refactor: No — must re-check by hand
Scales with the project: Degrades as complexity grows

Spec verification

The question: Does it satisfy each criterion?
The evidence: A passing test per criterion
Catches edge cases: Yes — each edge case is a test
Survives a refactor: Yes — tests re-run automatically
Scales with the project: Holds; the suite grows with it

Vibe-checking is not worthless — a human read still catches things tests miss, like a confusing API or a security smell. But it is a supplement, not the verification itself.

Who writes the tests

Three ways to avoid this, roughly in order of rigor:

Write the tests yourself, from the spec. Most reliable. Your tests come straight from your criteria and are blind to the implementation. This is also a great use of the spec — you already did the thinking.
Have a separate agent or a fresh session write tests from the spec only. Give it the spec, not the implementation. Fresh context, no knowledge of how the code works, tests the contract.
Have the implementing agent write tests, then review each test against your spec. Lowest effort, lowest assurance. Read each test and confirm it actually checks a criterion, not just that the code does what the code does.

The principle underneath all three: the verification must be derived from the spec, independently of the implementation. Tests written by looking at the code only prove the code does what it does.

Reading the agent's results skeptically

Agents will often run the tests and report back "All tests pass." Treat that as a claim to audit, not a conclusion to accept. A few habits:

Check the test count. "All 3 tests pass" when your spec had eight criteria means five criteria are unverified.
Make sure a failing test can fail. A test with no real assertion, or one that asserts the code's actual behavior rather than the spec's expected behavior, passes meaninglessly. Glance at what each test actually checks.
Insist on real output. Ask the agent to paste the actual test runner output, not a summary. "Tests pass" is a summary; the runner's output is evidence.
Re-run them yourself when it matters. For anything important, run the suite in your own environment. It takes seconds and removes the agent from the trust chain entirely.

The verification checklist

Apply this to any agent output before you accept it:

List the criteria. Pull each acceptance criterion and edge case from the spec.
Map each to a check. Confirm there is a test (or a manual check, for the few things tests cannot cover) for every one.
Confirm the tests are spec-derived. They check expected behavior from the spec, not the implementation's actual behavior.
Run them and read real output. A green suite with the right number of meaningful tests.
Add a human read. Scan for security, clarity, and anything outside the spec that the agent changed.
Resolve gaps. Any unverified criterion goes back into the loop — which is the next lesson.

A note on scope drift

Key Takeaways

"It runs" is a false positive; verify against the spec, not against impressions.
Each acceptance criterion and edge case maps to a test case — that is why you wrote them checkably.
Verification must be derived from the spec independently of the implementation, or the tests just confirm the code does what it does.
Prefer writing tests yourself or with a spec-only fresh session over letting the implementer test itself.
Audit the agent's "all tests pass" claim: check the count, the assertions, and the real output, and re-run when it matters.
A human read catches scope drift — changes that satisfy no criterion are where surprises hide.

Verification: Testing Output Against the Spec, Not Vibes

What You'll Learn

"It runs" is not "it works"

Criteria map onto tests

Vibes versus tests

Vibe-checking

Spec verification

Who writes the tests

Reading the agent's results skeptically

The verification checklist

A note on scope drift

Key Takeaways

Quiz

Questions & Answers

Verification: Testing Output Against the Spec, Not Vibes

What You'll Learn

"It runs" is not "it works"

Criteria map onto tests

Vibes versus tests

Vibe-checking

Spec verification

Who writes the tests

Reading the agent's results skeptically

The verification checklist

A note on scope drift

Key Takeaways

Quiz

Questions & Answers