Jaypore Labs
Back to journal
Engineering

Sampling production traffic for eval

Sampling for eval is sampling for representativeness. The pattern matters.

Yash ShahMarch 11, 20262 min read

A team's eval was sampled from "any 100 production cases per week." The eval skewed toward the high-volume cases. The long-tail cases — where most failures lived — were under-represented.

Sampling for eval is sampling for representativeness, not random uniformity.

The sampling rule

The strategy depends on what you want the eval to capture:

  • Representative. Sample from the production distribution. Good for general quality.
  • Stratified. Sample equally from each category. Good for ensuring coverage.
  • Targeted. Sample from specific failure modes. Good for catching specific issues.

Most teams need a mix.

Privacy

Sampling involves privacy:

  • Sanitisation runs before storage.
  • Sampling rate balanced against privacy budget.
  • Retention period limited.

Reviewer ritual

Sampling configuration reviewed:

  • Per-category sample rates set.
  • Distribution audited periodically.
  • Targeting adjusted as new failure modes emerge.

A real workflow

A team's sampling:

  • 0.5% of all traffic for representative coverage.
  • Stratified: 50 cases per category per week.
  • Targeted: failures and escalations get 100% sampling (smaller absolute numbers).

The eval set grows representatively.

Trade-offs

  • More sampling: more privacy exposure, more storage, more eval-set growth.
  • Less sampling: less privacy exposure, sparser coverage.

The right balance depends on the team's privacy posture and eval needs.

What we won't ship

Sampling without privacy sanitisation.

Random sampling alone when stratified would give better coverage.

Sampling without retention policy.

Sampling without periodic audit.

Close

Sampling production traffic for eval is the discipline of building a representative eval set without exposing the team to privacy issues. Stratified, sanitised, retention-bounded. The eval matures because production informs it.

Related reading


We build AI-enabled software and help businesses put AI to work. If you're sampling for eval, we'd love to hear about it. Get in touch.

Tagged
EvalsSamplingEngineeringOutput TestingProduction
Share