A/B Comparing Prompts and Reducing Variance
You have two prompt versions and you need to know which is better, confidently, not by impression. And you have noticed that running the same prompt on the same input twice can give different answers. Both problems are about variance: the natural run-to-run variation in model output. This lesson shows you how to A/B compare prompts properly so the winner is real, and how to reduce variance so your prompt behaves consistently in the wild.
What You'll Learn
- Why a single output is a noisy signal and how to compare prompts fairly
- How to run an A/B comparison across an eval set
- The difference between a real improvement and random noise
- Concrete techniques to reduce output variance
- The tradeoff between consistency and creativity
One Output Is a Noisy Signal
Models are non-deterministic: identical input can produce different output on different runs. So if you compare prompt A's single output to prompt B's single output and B looks better, you have learned almost nothing. B might just have gotten a luckier roll. To compare fairly, you have to compare across many cases, and ideally average over multiple runs per case.
This is exactly why you built an eval set. A/B testing prompts is just running both prompts against the same frozen set and comparing their aggregate scores.
Running a Fair A/B Comparison
The procedure:
- Take both prompt versions, A and B.
- Run each against the same frozen eval set.
- Score every output with the same rubric or checks.
- Compare the aggregate scores (average score or pass rate).
- Where it matters, also do a pairwise judge comparison: for each case, ask which output is better, controlling for position bias by running both orderings.
The two views answer different questions. Aggregate scores tell you "how good is each prompt overall." The pairwise win rate tells you "how often does B beat A head-to-head," which is often the more decision-relevant number.
Is It a Real Improvement or Just Noise?
Suppose A scores 80% and B scores 82% on a fifteen-case set. Is B actually better? Maybe not. With a small set and run-to-run variance, a two-point gap could easily be noise. A few sanity checks before you commit:
- Margin versus set size. A two-point win on fifteen cases is weak. A ten-point win is more believable. Bigger gaps and bigger sets give you more confidence.
- Consistency across runs. Run the comparison again. If B wins by a similar margin each time, the result is stable. If the winner flips between runs, you are looking at noise, and the two prompts are effectively tied.
- Where the wins come from. If B's advantage is concentrated in one or two cases, it may be luck on those specific inputs. If B wins broadly across the set, the improvement is real.
You do not need formal statistics to be disciplined here. The instinct to capture is: a small, flickering gap is probably noise, and a large, repeatable gap is probably real. When two prompts are genuinely tied on quality, choose on the secondary factors from the next lesson, such as cost and latency.
Reducing Variance
Sometimes the problem is not which prompt is better but that a single prompt is inconsistent: great one run, mediocre the next. Several techniques tighten that up.
- Lower the temperature. Temperature is the model setting that controls randomness. Lower values make output more deterministic and repeatable; higher values make it more varied and creative. For extraction, classification, and anything a machine consumes, run at a low temperature. Many tools expose this setting directly.
- Be more specific. Variance often comes from ambiguity. If the prompt leaves a choice open, the model makes a different choice each run. Nail down the format, the vocabulary, and the rules, and there is less room to wander.
- Constrain the output shape. A tightly specified structure (fixed fields, enums) has far less room to vary than free-form prose.
- Add examples. A few examples anchor the model to a consistent style and format, reducing drift between runs.
- Self-consistency for reasoning tasks. For tasks with a single correct answer, run the prompt a few times and take the majority answer. This averages out individual bad runs at the cost of extra calls. Reserve it for high-stakes correctness, not everyday use.
Consistency Versus Creativity
Reducing variance is not always the goal. There is a real tradeoff:
- For structured, correctness-driven tasks (extraction, classification, data transformation), you want low variance. Same input, same output, every time. Push temperature down and constrain hard.
- For generative, exploratory tasks (brainstorming, drafting, ideating), some variance is a feature. You want different angles across runs, so a higher temperature helps.
The skill is knowing which mode you are in and tuning accordingly. Do not crank determinism on a brainstorming prompt and then wonder why every run gives the same three boring ideas. Do not run a data-extraction prompt hot and then wonder why the JSON keeps changing shape.
Key Takeaways
- A single output is a noisy signal; compare prompts across the same frozen eval set, scored the same way.
- Use aggregate scores for overall quality and a position-controlled pairwise win rate for head-to-head decisions.
- A small, flickering score gap is probably noise; a large, repeatable, broadly-distributed gap is probably real.
- Reduce variance with lower temperature, more specificity, constrained output shape, and examples; use self-consistency for high-stakes reasoning.
- Match the setting to the task: low variance for structured correctness, higher variance for creative exploration.

