Engineering

Calibrating your judge: meta-evals

The judge needs its own eval. The meta-eval is the discipline.

Yash ShahMarch 23, 20262 min read

The team's LLM-as-judge had been running for six months. Output quality looked stable. Then a customer complaint revealed the judge had drifted — it was now scoring outputs more leniently than humans would. The judge's own quality had degraded silently.

The judge needs its own eval. Meta-evals are the discipline.

The meta-eval

Quarterly:

Sample 100 cases.
Have humans score them.
Have judge score them.
Measure agreement.
If agreement is below threshold, recalibrate.

This is restaurant-health-inspection-style discipline applied to the judge itself.

Cadence

Most teams should meta-eval:

Quarterly (default).
After judge prompt updates.
After model bumps for the judge.
When output quality complaints arise.

Reviewer ritual

Meta-eval results:

Agreement rate per dimension.
Disagreement patterns.
Investigation of patterns.

If the judge consistently disagrees with humans on certain dimensions, the judge's prompt needs work.

A real calibration

A team's quarterly meta-eval:

Q1: 84% agreement.
Q2: 86% (improved after rubric clarification).
Q3: 79% (regressed; investigated; provider model bump caused it).
Q3 recalibration: 85% (judge prompt adjusted for new model).

Without meta-evals, the Q3 regression would have gone undetected for months.

Limits

Meta-evals don't catch:

Cases where humans systematically disagree (the judge can't be more right than humans).
Edge cases not in the meta-eval sample.

The discipline is necessary but not sufficient.

What we won't ship

Judges without meta-eval cadence.

Meta-evals without investigation when agreement drops.

Judges the team trusts blindly.

Skipping the calibration after model bumps for the judge.

Close

Judges need their own evals. The meta-eval is the discipline. Quarterly checks. Investigation on disagreement. Recalibration when needed. The judge stays trustworthy because the team measures.

Calibrating your judge: meta-evals

The meta-eval

Cadence

Reviewer ritual

A real calibration

Limits

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors