Engineering

Behavioural assertions: testing 'should-ness'

Some behaviours can't be tested with exact-match. Behavioural assertions are the discipline.

Yash ShahMarch 18, 20262 min read

A team's prose-generation feature couldn't be tested with exact-match. The model produced different outputs each run; all of them were acceptable. Traditional asserts didn't apply. The team needed a new pattern: behavioural assertions.

The shouldness pattern

Behavioural assertions test what the output should be, not what it equals:

Should contain certain key information.
Should not exceed a length.
Should be on-topic.
Should follow the brand voice.
Should cite sources.

These are the qualities the team cares about. The exact wording isn't.

Implementation

Two strategies:

Programmatic. Regex, length checks, contains-checks, structural checks. Fast and reliable for narrow assertions.

LLM-as-judge. A separate model checks against the rubric. Slower and noisier, but handles fuzzy assertions.

Most teams use a mix.

Reviewer ritual

Behavioural assertions get reviewed:

Did the assertion catch real issues, or false positives?
Are new shouldness dimensions emerging?
Are old assertions still relevant?

A real test set

A team's customer-email-generation tests:

Programmatic: greeting present, customer name correct, length under 250 words, no profanity, contains specific keywords for the email type.
LLM-judge: tone matches brand voice, no factual claims without basis, professionally appropriate.

Together: comprehensive coverage, no exact-match assumption.

Trade-offs

Programmatic assertions:

Fast, deterministic, cheap.
Limited to what regex can express.

LLM-as-judge:

Handles fuzzy assertions.
Slower, noisier, costlier.
Needs calibration.

The right mix depends on what the team is testing.

What we won't ship

Exact-match assertions for outputs with legitimate variance.

LLM-as-judge without calibration.

Behavioural assertions that don't catch what they were designed for. Useless tests are worse than no tests.

Skipping the eval-eval review. What does the eval catch? Verify.

Close

Behavioural assertions are the new pattern for testing AI outputs. Programmatic for what's testable cheaply; LLM-as-judge for what isn't. Together, they cover the shouldness without forcing exact-match. The team's tests pass when the output is good, not when it's identical.

Behavioural assertions: testing 'should-ness'

The shouldness pattern

Implementation

Reviewer ritual

A real test set

Trade-offs

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors