Engineering

Evals that survive a model bump

Provider model upgrades break evals if the eval depends on specific model behaviour. The discipline is decoupling.

Yash ShahMarch 12, 20263 min read

A team's eval scored at 96% on model X. Provider released model Y. Same eval scored at 78%. The team panicked. Investigation revealed: the eval was over-fit to model X's specific phrasings. Model Y was producing different but equally valid outputs.

Evals that survive a model bump are decoupled from specific model behaviours.

The migration discipline

When a model bumps:

Run eval on new model.
If significant regression, investigate.
Distinguish: is this real regression, or is the eval over-fit?
Adjust eval if over-fit; adjust prompt if real regression.

The investigation is the discipline. Skipping it leads to either rolling back unnecessarily or hiding real regressions.

Reviewer ritual

Pre-bump:

Eval against new model on a branch.
Compare to current model.
Identify cohorts that moved.
Decide: bump, adjust, defer.

Post-bump:

Monitor production traffic.
Watch for issues eval didn't catch.
Be ready to roll back.

A real migration

A team's bump from Opus 4.6 to Opus 4.7:

Pre-bump eval: 96% on 4.6, 91% on 4.7.
Investigation: 4.7 was more concise, eval cases expecting longer responses regressed.
Decision: update eval to accept varied lengths (real improvement, not regression).
Post-update eval: 96% on 4.6, 95% on 4.7.
Bump shipped.

The migration was three days of work. Without the investigation, the team would have rolled back and missed a quality improvement.

Decoupling

Evals decouple from specific model behaviours by:

Behavioural rather than exact-match assertions.
Multiple acceptable outputs per case.
Rubric-based scoring.

These survive model variance. Exact-match against specific phrasing doesn't.

Limits

Some evals have to be exact (classification, structured extraction). For these, the migration involves prompt adjustment.

What we won't ship

Evals that are purely exact-match for tasks where variance is OK.

Migrations without eval re-run.

Bumps without investigation when scores move.

Skipping the post-bump production monitoring.

Close

Evals survive model bumps when decoupled from specific model behaviour. Behavioural assertions, multiple acceptable outputs, rubric scoring. The team's eval keeps signal across migrations. Skip the discipline and every model bump becomes a fire drill.

Evals that survive a model bump

The migration discipline

Reviewer ritual

A real migration

Decoupling

Limits

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors