Engineering

Human eval workflows: instructions that don't vary

Human evaluators need instructions clear enough that different humans agree. The workflow is the discipline.

Yash ShahMarch 26, 20262 min read

A team's human-eval workflow had reviewers scoring outputs from 1-5 across multiple dimensions. Inter-rater agreement was poor. Same output, different scores. The reviewers were operating without clear instructions.

Human evaluators need instructions clear enough that different humans agree. The workflow is the discipline.

The reviewer's brief

For each eval task:

Specific dimensions to score.
Clear definitions.
Anchored examples.
What to do in edge cases.
Time per case (typical).

Without these, reviewers improvise. Improvisation is variance.

Onboarding

New reviewers go through:

Reading the brief.
Practice cases with feedback.
Calibration session with experienced reviewers.
First independent batch with audit.

Skipping onboarding produces inconsistent reviews from day one.

Reviewer ritual

Periodic:

Inter-rater agreement audits.
Brief updates as new edge cases emerge.
Reviewer feedback sessions.

A real workflow

A team's human-eval setup:

5 reviewers trained.
Each case scored by 2 reviewers.
Disagreements escalate to the lead.
Quarterly inter-rater agreement audit.
Brief updated based on patterns.

Inter-rater agreement: 87%. Workable signal.

Trade-offs

Strict workflow: higher agreement, slower onboarding, less reviewer judgment.
Loose workflow: faster onboarding, more reviewer judgment, lower agreement.

Most teams should err strict. Loose feels good but produces noise.

What we won't ship

Human-eval workflows without explicit instructions.

Reviewers without onboarding.

Single-reviewer scoring for high-stakes evals.

Skipping the inter-rater agreement audit.

Close

Human eval requires discipline. The brief is the spec. Onboarding is non-optional. Agreement is measured. The workflow stays consistent because the team invests in it. Skip these and human eval produces noise.

Human eval workflows: instructions that don't vary

The reviewer's brief

Onboarding

Reviewer ritual

A real workflow

Trade-offs

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors