Engineering

Pairwise judges: A/B agreement at scale

Pairwise comparison is more reliable than absolute scoring. The pattern works at scale.

Yash ShahApril 22, 20262 min read

Absolute scoring ("rate this output 1-5") is hard. Different humans calibrate differently. The same human may score differently on different days. Pairwise comparison ("which of these two is better?") sidesteps the calibration problem.

For LLM-as-judge, pairwise often beats absolute.

The pairwise pattern

For each comparison:

Two outputs (A and B).
Judge picks better one (or marks tie).
Optionally, with rationale.

Aggregated across comparisons:

A wins X% of the time.
B wins Y% of the time.
Ties Z% of the time.

This produces a reliable ranking.

Cost shape

Pairwise costs more per case (two outputs to evaluate, plus the judge call). But:

Calibration is easier.
Agreement between judges is higher.
The signal is cleaner.

For high-stakes evals, the cost is justified.

Reviewer ritual

Pairwise eval results:

A vs. B agreement rate.
Confidence intervals.
Per-category breakdowns.

A team comparing two prompt versions can clearly say "B is better X% of the time."

A real run

A team comparing prompt versions:

100 inputs.
Each input run through both prompts.
Outputs paired and judged.
Result: prompt B wins 67%, prompt A wins 18%, ties 15%.
Decision: ship prompt B.

The decision is data-driven. Prompt B is clearly preferred.

Trade-offs

Pairwise: more reliable, more expensive.
Absolute: cheaper, noisier.

Use pairwise for high-stakes decisions. Absolute for ongoing monitoring.

What we won't ship

Pairwise without judge calibration.

Conclusions based on small pairwise samples (need power calculation).

Skipping the rationale. Why one wins matters.

Pairwise where absolute would do. Don't over-engineer.

Close

Pairwise judges produce more reliable signal than absolute. The cost is higher. The decisions are clearer. For consequential prompt comparisons, the pattern is worth the cost.

Pairwise judges: A/B agreement at scale

The pairwise pattern

Cost shape

Reviewer ritual

A real run

Trade-offs

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors