LLM-as-judge: when to trust it, when not

A team I helped last summer was using an LLM to grade their own LLM's outputs. The judge had been running for six months. Output quality looked stable on the dashboard. Then a customer complaint revealed the judge had quietly drifted — it was now scoring outputs more leniently than the team's own humans would. The team had a "passing" feature that was, in fact, getting subtly worse.

The judge's own quality had degraded. Nobody had been checking it.

LLM-as-judge — using a model to grade the outputs of another model (or the same model with a different prompt) — is the eval pattern that scales. You can grade dimensions humans can't easily encode in rules. You can run thousands of cases nightly without paying a thousand human reviewers. You can iterate prompts at the speed the model allows.

It's also limited. Judges have biases. They drift. They share training data with the agents they're grading, which means they sometimes share the agent's blind spots too. The discipline that makes LLM-as-judge reliable is calibration, repeated.

The judge contract

A well-designed judge is more than "ask the model to score this output." A working judge has:

A specific rubric. Not "is this good?" — concrete dimensions with anchored scores.
A structured score format. JSON with bounded values, not free prose.
A rationale field. Why did the judge score this way? Auditable.
Calibration data. Human-graded comparison cases, refreshed regularly.
Audit cadence. Periodic re-scoring against humans to detect drift.

Without these, the judge is unreliable. With them, the judge is a useful tool.

A working judge prompt looks roughly like this:

JUDGE_PROMPT = """
You are evaluating a customer-support agent's response.

Score the response on three dimensions, each 1-5:

TONE
  1 = rude or dismissive
  3 = neutral, professional
  5 = warm, friendly, on-brand

ACCURACY
  1 = factually wrong; would mislead the customer
  3 = correct but missing useful context
  5 = correct, complete, includes proactive next steps

RELEVANCE
  1 = does not address the customer's actual question
  3 = addresses the question but with extra unrelated content
  5 = addresses exactly what was asked, no more no less

For each dimension, provide:
  - The integer score (1-5).
  - A one-sentence rationale.

Return JSON matching this schema:

{
  "tone": {"score": int, "rationale": str},
  "accuracy": {"score": int, "rationale": str},
  "relevance": {"score": int, "rationale": str},
  "overall_pass": bool  // true if all three scores >= 3
}

Customer message:
{customer_message}

Agent response:
{agent_response}
"""

Notice the structure. Each dimension has anchored scores at 1, 3, and 5. The judge can't drift toward "everything is a 4" because the rubric anchors at concrete behaviour. The output is JSON with specific shape, validated client-side. The overall pass/fail is derived from the dimensions, not asked of the judge directly — which means the gate is auditable in terms of the underlying scores.

Calibration: the load-bearing step

The calibration is what makes the judge trustworthy. Without it, you're inheriting whatever biases the judge has out of the box.

The calibration loop:

Sample 100 representative cases from your eval set.
Have humans (ideally 2-3 per case) score them using the same rubric the judge uses.
Have the judge score the same 100 cases.
Measure agreement. Cohen's kappa or simple percent agreement, depending on your tolerance for statistical rigor.
If agreement is below threshold (we use ≥ 0.7 kappa, or ≥ 80% binary-pass agreement), iterate the judge prompt and re-calibrate.

def calibrate_judge(cases: list[Case], judge_prompt: str, human_scores: dict):
    """Returns agreement statistics between judge and humans."""
    judge_scores = [run_judge(c, judge_prompt) for c in cases]
    return {
        "tone_kappa": cohen_kappa(human_scores["tone"], [s["tone"]["score"] for s in judge_scores]),
        "accuracy_kappa": cohen_kappa(human_scores["accuracy"], [s["accuracy"]["score"] for s in judge_scores]),
        "relevance_kappa": cohen_kappa(human_scores["relevance"], [s["relevance"]["score"] for s in judge_scores]),
        "pass_agreement_pct": percent_agreement(
            [s["overall_pass"] for s in judge_scores],
            human_scores["pass"],
        ),
    }

A judge whose agreement with humans is poor isn't ready. We've shipped judges with kappa as low as 0.45 on initial calibration. After two-three rounds of rubric tightening, the same judges hit 0.75-0.85.

Where judges shine

Open-ended prose outputs. Summaries, explanations, customer-facing text. Hard to grade with exact-match rules.
Multi-dimensional rubrics. Tone × Accuracy × Relevance. A judge can score multiple dimensions in one call cheaper than humans.
High-volume cases. A nightly run over 5,000 production samples doesn't scale to humans. It scales to a calibrated judge.
Adversarial detection. Judges can pattern-match on suspicious outputs (prompt-injection echoes, scope-bypass attempts) more reliably than rules.

The team I started this article with was using their judge in exactly the right places — open-ended customer-support prose, daily eval over thousands of production samples. The drift wasn't a problem with the use case; it was a problem with the absence of recalibration.

Where judges fail

Domain-specific accuracy in specialised fields (medicine, law, regulated finance). The judge is unlikely to have the depth to detect subtle factual errors. Use domain experts, with the judge as a first-pass filter.
Cases where the judge shares the agent's biases. Both come from similar training data. Both can be wrong in the same way and look like agreement.
Truly ambiguous cases where reasonable humans disagree. The judge can't be more right than humans.
Tasks the judge doesn't have the inputs to evaluate properly. If grading "did the agent escalate appropriately" requires knowing the customer's history and the team's escalation policy, give the judge those inputs explicitly.

Audit cadence

Judges drift. Quarterly meta-evals catch it.

@scheduled("0 0 1 */3 *")  # Quarterly
def quarterly_judge_audit():
    cases = load_audit_set("support-judge-audit-100.jsonl")
    human_scores = load_human_scores_from_last_audit(cases)
    agreement = calibrate_judge(cases, current_judge_prompt(), human_scores)

    if agreement["pass_agreement_pct"] < 0.80:
        page_team("Judge agreement below 80% — recalibration needed")
    else:
        log_metric("judge_audit_passed", agreement)

The team I helped now runs this. The original drift would have been caught in the first quarterly audit. Going forward, drift surfaces within 90 days at worst.

What we won't ship

Judges without calibration. Every time we've shipped one, we've ended up with the kind of incident that opened this article.

Judges with vague rubrics. "Is this good?" is not a rubric. Anchored scores are the discipline.

Skipping the periodic audit. Judges drift; the audit is the discipline that catches it.

Trusting judges in domains where they share the agent's biases. Use humans for those.

Close

LLM-as-judge is a powerful eval pattern when calibrated. The rubric is specific. The agreement rate is measured. The audit cadence holds. Skip the calibration and the judge produces noise dressed as signal — and you don't notice until the customer noticed first.

The team I started with re-calibrated their judge, fixed the drift, and now runs quarterly audits as a permanent part of their eval infrastructure. The cost is half a day of human-grading time per quarter. The benefit is that the judge they're trusting at scale is a judge they've actually verified, repeatedly.

LLM-as-judge: when to trust it, when not

The judge contract

Calibration: the load-bearing step

Where judges shine

Where judges fail

Audit cadence

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors