A team had two integration tests for the same AI feature — one that mocked the LLM with a fixed response (contract test), and one that called the real LLM and checked behaviour (behavioural test). They couldn't decide which was right. The answer is: both are right, for different things.
Contract tests
Contract tests verify the integration's plumbing:
- Tool calls are made in the right order.
- Response shapes match expectations.
- Error handling kicks in when expected.
- Side effects happen as expected.
These tests mock the LLM with predictable responses. They're fast, deterministic, and catch plumbing issues.
Behavioural tests
Behavioural tests verify the LLM-powered behaviour:
- Given a real input, does the system produce a useful output?
- Does the agent handle edge cases correctly?
- Does the multi-step flow converge?
These tests call the actual LLM. They're slower, more expensive, and have variance.
When each wins
Contract tests:
- Run on every PR.
- Catch refactoring regressions.
- Verify integration logic.
- Don't catch model-quality regressions.
Behavioural tests:
- Run on schedule (nightly) or on significant changes.
- Catch model-quality regressions.
- Verify the system end-to-end.
- Don't catch every refactoring regression.
A team needs both. Each catches what the other doesn't.
Reviewer ritual
PR reviews check both:
- Were contract tests updated to reflect changes?
- Were behavioural tests run on significant changes?
- Are eval set updates needed?
A real test set
A team's setup:
- 80 contract tests run per PR (mocked LLM).
- 30 behavioural tests run nightly.
- Failures in behavioural tests trigger eval-set review.
- Pattern: contract tests catch refactoring; behavioural tests catch quality drift.
Trade-offs
- Contract tests are cheap; behavioural tests are expensive.
- Contract tests are deterministic; behavioural tests have variance.
- Contract tests miss what behavioural tests catch (and vice versa).
The trade-off is investment. Most teams under-invest in behavioural; some over-invest in contract.
What we won't ship
Test suites with only contract tests. Quality regressions go unnoticed.
Test suites with only behavioural tests. PR cycles slow to a crawl.
Behavioural tests with extreme variance. Test stability matters; flake the tests, lose the signal.
Skipping the integration layer entirely. Unit + e2e is not enough.
Close
Integration tests for AI features come in two flavours: contract (cheap, deterministic) and behavioural (expensive, real). Both are needed. Run contract on every PR; run behavioural on a schedule. The team's coverage spans plumbing and quality.
Related reading
- The new test pyramid — preceding pattern.
- Behavioural assertions — companion topic.
- Mock LLMs in tests — implementation depth.
We build AI-enabled software and help businesses put AI to work. If you're tightening integration tests, we'd love to hear about it. Get in touch.