Engineering

Integration tests for AI features: contract or behavioural?

Two patterns for integration testing AI features. Pick the right one for the work.

Yash ShahApril 27, 20263 min read

A team had two integration tests for the same AI feature — one that mocked the LLM with a fixed response (contract test), and one that called the real LLM and checked behaviour (behavioural test). They couldn't decide which was right. The answer is: both are right, for different things.

Contract tests

Contract tests verify the integration's plumbing:

Tool calls are made in the right order.
Response shapes match expectations.
Error handling kicks in when expected.
Side effects happen as expected.

These tests mock the LLM with predictable responses. They're fast, deterministic, and catch plumbing issues.

Behavioural tests

Behavioural tests verify the LLM-powered behaviour:

Given a real input, does the system produce a useful output?
Does the agent handle edge cases correctly?
Does the multi-step flow converge?

These tests call the actual LLM. They're slower, more expensive, and have variance.

When each wins

Contract tests:

Run on every PR.
Catch refactoring regressions.
Verify integration logic.
Don't catch model-quality regressions.

Behavioural tests:

Run on schedule (nightly) or on significant changes.
Catch model-quality regressions.
Verify the system end-to-end.
Don't catch every refactoring regression.

A team needs both. Each catches what the other doesn't.

Reviewer ritual

PR reviews check both:

Were contract tests updated to reflect changes?
Were behavioural tests run on significant changes?
Are eval set updates needed?

A real test set

A team's setup:

80 contract tests run per PR (mocked LLM).
30 behavioural tests run nightly.
Failures in behavioural tests trigger eval-set review.
Pattern: contract tests catch refactoring; behavioural tests catch quality drift.

Trade-offs

Contract tests are cheap; behavioural tests are expensive.
Contract tests are deterministic; behavioural tests have variance.
Contract tests miss what behavioural tests catch (and vice versa).

The trade-off is investment. Most teams under-invest in behavioural; some over-invest in contract.

What we won't ship

Test suites with only contract tests. Quality regressions go unnoticed.

Test suites with only behavioural tests. PR cycles slow to a crawl.

Behavioural tests with extreme variance. Test stability matters; flake the tests, lose the signal.

Skipping the integration layer entirely. Unit + e2e is not enough.

Close

Integration tests for AI features come in two flavours: contract (cheap, deterministic) and behavioural (expensive, real). Both are needed. Run contract on every PR; run behavioural on a schedule. The team's coverage spans plumbing and quality.

Integration tests for AI features: contract or behavioural?

Contract tests

Behavioural tests

When each wins

Reviewer ritual

A real test set

Trade-offs

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors