Engineering

Eval cost management

Eval costs scale with eval-set size. The team needs to manage spend like any engineering cost.

Yash ShahMarch 2, 20262 min read

A team's eval-set grew to 800 cases. Running it cost $50 per CI run. With 30 PRs per week, eval cost $6K/month. It was a meaningful line item the team hadn't budgeted for.

Eval cost management is the discipline that keeps the eval suite affordable as it grows.

The sampling pattern

Most eval doesn't need to run on every PR:

Smoke: 10-20% of cases per PR. Cheap, fast.
Full: nightly + on significant changes. More expensive.
Comprehensive: weekly or quarterly. Most expensive.

Per-PR cost drops dramatically when smoke is the default.

Caching

Where the eval involves repeated prompts:

Prefix caching reduces token cost.
Response caching for deterministic eval cases.

The cache lives between runs.

Reviewer ritual

Eval cost reviewed monthly:

Total spend.
Cost per run.
Cost per case.
Trend.

Significant moves investigated. Often a runaway is a flaky case re-running, or a stale model that's been bumped to higher pricing.

A real saving

A team's eval optimisation:

Pre-optimisation: $6K/month, 800-case full set per PR.
Post-optimisation: $1K/month, 100-case smoke per PR + nightly full.
Quality picture: equivalent (smoke catches the common regressions).

A single afternoon's work saved $5K/month.

Trade-offs

Smaller per-PR eval = faster CI but slower regression detection.
Smoke set quality matters; design carefully.

Most teams over-eval per PR. Smoke + full is more economical and equally protective.

What we won't ship

Full eval on every PR without justification.

Eval costs that aren't budgeted.

Skipping the monthly cost review.

Caching that doesn't get invalidated when prompts change.

Close

Eval cost management is the engineering of running evals affordably. Smoke + full strategy. Caching. Monthly cost review. The team's eval suite scales without becoming a cost crisis.

Eval cost management

The sampling pattern

Caching

Reviewer ritual

A real saving

Trade-offs

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors