Engineering

End-to-end tests for AI workflows: scope and survival

E2E tests for AI are expensive and brittle. Use them sparingly; design them for survival.

Yash ShahApril 23, 20262 min read

A team's e2e test suite for an AI workflow grew to 80 tests. It took 45 minutes to run. Half the tests flaked occasionally. The team stopped trusting the suite. Then the suite stopped catching anything because failures were assumed to be flake.

E2E tests for AI are expensive and brittle. The discipline is using them sparingly and designing them for survival.

The thin-slice pattern

E2E coverage is the smallest possible:

One test per critical user-flow.
10-20 e2e tests for most products.
Each one tests the entire flow end-to-end.

Most coverage lives in lower layers (unit, integration, eval). E2E catches what the layered tests can't.

Reviewer ritual

PR review:

E2E flake rate is acceptable.
New e2e tests justified (most testing belongs lower).
Failed e2e tests investigated, not retried.

A real test

A team's e2e test for a customer-onboarding agent:

Real user account created.
Real document upload.
Real LLM agent run.
Assertions on final state.
Cleanup at end.

Run nightly + pre-release. Not on every PR (too slow, too costly).

Coverage

What e2e covers:

Critical user-flows.
Cross-system integration (DB, queue, LLM, frontend).
Real-data scenarios (sanitised production samples).

What e2e doesn't cover:

Most behaviours (lower layers).
Edge cases (use unit/integration).
Performance regressions (use perf tests).

Maintenance

E2E maintenance:

Quarterly review.
Stale tests retire.
Flaky tests investigate-and-fix or retire.
New tests added when a user-flow gap is exposed by an incident.

What we won't ship

Broad e2e coverage. Lower layers are cheaper and more reliable.

E2E tests without flake-rate metrics.

Skipping investigation of e2e failures.

E2E tests that test the same thing as integration tests.

Close

E2E tests for AI workflows are expensive and brittle. Use them sparingly. Design them for the user-flow that matters. Maintain them ruthlessly. The team's confidence comes from the layered tests; e2e is the final check.

End-to-end tests for AI workflows: scope and survival

The thin-slice pattern

Reviewer ritual

A real test

Coverage

Maintenance

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors