Jaypore Labs
Back to journal
Engineering

Behavioural assertions: testing 'should-ness'

Some behaviours can't be tested with exact-match. Behavioural assertions are the discipline.

Yash ShahMarch 18, 20262 min read

A team's prose-generation feature couldn't be tested with exact-match. The model produced different outputs each run; all of them were acceptable. Traditional asserts didn't apply. The team needed a new pattern: behavioural assertions.

The shouldness pattern

Behavioural assertions test what the output should be, not what it equals:

  • Should contain certain key information.
  • Should not exceed a length.
  • Should be on-topic.
  • Should follow the brand voice.
  • Should cite sources.

These are the qualities the team cares about. The exact wording isn't.

Implementation

Two strategies:

Programmatic. Regex, length checks, contains-checks, structural checks. Fast and reliable for narrow assertions.

LLM-as-judge. A separate model checks against the rubric. Slower and noisier, but handles fuzzy assertions.

Most teams use a mix.

Reviewer ritual

Behavioural assertions get reviewed:

  • Did the assertion catch real issues, or false positives?
  • Are new shouldness dimensions emerging?
  • Are old assertions still relevant?

A real test set

A team's customer-email-generation tests:

  • Programmatic: greeting present, customer name correct, length under 250 words, no profanity, contains specific keywords for the email type.
  • LLM-judge: tone matches brand voice, no factual claims without basis, professionally appropriate.

Together: comprehensive coverage, no exact-match assumption.

Trade-offs

Programmatic assertions:

  • Fast, deterministic, cheap.
  • Limited to what regex can express.

LLM-as-judge:

  • Handles fuzzy assertions.
  • Slower, noisier, costlier.
  • Needs calibration.

The right mix depends on what the team is testing.

What we won't ship

Exact-match assertions for outputs with legitimate variance.

LLM-as-judge without calibration.

Behavioural assertions that don't catch what they were designed for. Useless tests are worse than no tests.

Skipping the eval-eval review. What does the eval catch? Verify.

Close

Behavioural assertions are the new pattern for testing AI outputs. Programmatic for what's testable cheaply; LLM-as-judge for what isn't. Together, they cover the shouldness without forcing exact-match. The team's tests pass when the output is good, not when it's identical.

Related reading


We build AI-enabled software and help businesses put AI to work. If you're tightening behavioural assertions, we'd love to hear about it. Get in touch.

Tagged
TestingAI EngineeringEngineeringTesting for AIAssertions
Share