Jaypore Labs
Back to journal
Engineering

Integration tests for AI features: contract or behavioural?

Two patterns for integration testing AI features. Pick the right one for the work.

Yash ShahApril 27, 20263 min read

A team had two integration tests for the same AI feature — one that mocked the LLM with a fixed response (contract test), and one that called the real LLM and checked behaviour (behavioural test). They couldn't decide which was right. The answer is: both are right, for different things.

Contract tests

Contract tests verify the integration's plumbing:

  • Tool calls are made in the right order.
  • Response shapes match expectations.
  • Error handling kicks in when expected.
  • Side effects happen as expected.

These tests mock the LLM with predictable responses. They're fast, deterministic, and catch plumbing issues.

Behavioural tests

Behavioural tests verify the LLM-powered behaviour:

  • Given a real input, does the system produce a useful output?
  • Does the agent handle edge cases correctly?
  • Does the multi-step flow converge?

These tests call the actual LLM. They're slower, more expensive, and have variance.

When each wins

Contract tests:

  • Run on every PR.
  • Catch refactoring regressions.
  • Verify integration logic.
  • Don't catch model-quality regressions.

Behavioural tests:

  • Run on schedule (nightly) or on significant changes.
  • Catch model-quality regressions.
  • Verify the system end-to-end.
  • Don't catch every refactoring regression.

A team needs both. Each catches what the other doesn't.

Reviewer ritual

PR reviews check both:

  • Were contract tests updated to reflect changes?
  • Were behavioural tests run on significant changes?
  • Are eval set updates needed?

A real test set

A team's setup:

  • 80 contract tests run per PR (mocked LLM).
  • 30 behavioural tests run nightly.
  • Failures in behavioural tests trigger eval-set review.
  • Pattern: contract tests catch refactoring; behavioural tests catch quality drift.

Trade-offs

  • Contract tests are cheap; behavioural tests are expensive.
  • Contract tests are deterministic; behavioural tests have variance.
  • Contract tests miss what behavioural tests catch (and vice versa).

The trade-off is investment. Most teams under-invest in behavioural; some over-invest in contract.

What we won't ship

Test suites with only contract tests. Quality regressions go unnoticed.

Test suites with only behavioural tests. PR cycles slow to a crawl.

Behavioural tests with extreme variance. Test stability matters; flake the tests, lose the signal.

Skipping the integration layer entirely. Unit + e2e is not enough.

Close

Integration tests for AI features come in two flavours: contract (cheap, deterministic) and behavioural (expensive, real). Both are needed. Run contract on every PR; run behavioural on a schedule. The team's coverage spans plumbing and quality.

Related reading


We build AI-enabled software and help businesses put AI to work. If you're tightening integration tests, we'd love to hear about it. Get in touch.

Tagged
TestingAI EngineeringEngineeringTesting for AIIntegration Tests
Share