Engineering

Eval-driven development

Build the eval before the prompt. The discipline that prevents over-fitting.

Yash ShahMarch 3, 20263 min read

The TDD analogue for AI: write the eval before the prompt. The eval defines what success looks like. The prompt is engineered to pass the eval. The prompt that passes the eval ships.

This is eval-driven development (EDD). It feels backwards. It's the right way.

EDD vs. TDD

TDD: write the test, write the code that passes, refactor.

EDD: write the eval, write the prompt that passes, iterate.

Same discipline, different artifacts. The eval describes what the prompt should do; the prompt is engineered to that description.

Building tests before prompts

The workflow:

Specify what the feature should do.
Build the eval set covering happy path, edge cases, adversarial.
Write the prompt.
Run the eval. Score it.
Iterate the prompt until eval passes.
Ship.

Step 2 is the discipline. Most teams skip it (write the prompt first, build the eval after). EDD inverts this.

Reviewer ritual

PR for EDD work:

Eval set is in the PR.
Eval pass rate is in the PR.
Prompt is in the PR.

Reviewer evaluates: is the eval covering what matters? does the prompt pass? do they together describe a working feature?

A real workflow

A team building a categorisation feature:

Day 1: PM and engineer specify behaviour. 50 eval cases authored.
Day 2: Engineer writes prompt. Eval at 78%.
Day 3: Iteration. Eval at 89%.
Day 4: More iteration + prompt-engineering. Eval at 96%.
Day 5: Edge case work. Eval at 98%. Ships.

The eval is the spec. The prompt converges to it.

What this prevents

EDD prevents:

Prompts over-fit to whatever the engineer was thinking.
Features shipped without a clear definition of working.
Future-engineer confusion about what the feature is supposed to do.

What we won't ship

Features without eval sets.

Eval sets written after the prompt.

Eval sets that just describe what the prompt currently does.

Pass rates below threshold.

Close

Eval-driven development is the discipline of writing the eval first. The eval defines the feature. The prompt converges. The team ships when the eval is clean. Skip EDD and you ship features whose definition is whatever the prompt happens to do.

Eval-driven development

EDD vs. TDD

Building tests before prompts

Reviewer ritual

A real workflow

What this prevents

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors