Engineering

Evals for retrieval: separating retrieval from synthesis

RAG eval needs two layers — retrieval-only and full-pipeline. Separating them clarifies failures.

Yash ShahMarch 18, 20262 min read

A team's RAG eval was end-to-end. When it failed, the team didn't know whether retrieval was wrong or synthesis was wrong. They'd spend hours digging.

Separating retrieval from synthesis in eval reveals what's failing.

The two-stage eval

Stage 1 — Retrieval eval.

Input: query.
Output: retrieved documents.
Metric: precision@k, recall@k, MRR.
Pass criterion: relevant docs in top-k.

Stage 2 — Synthesis eval.

Input: query + retrieved documents.
Output: synthesised answer.
Metric: accuracy, citation correctness, completeness.
Pass criterion: answer matches expected.

Each stage tests one thing. End-to-end eval combines both but doesn't isolate.

Tooling

Retrieval eval: vector-DB tooling provides precision/recall metrics.
Synthesis eval: the team's standard LLM-eval tooling.

The two-stage approach uses two different evaluators.

Reviewer ritual

PR review for RAG changes:

Retrieval metrics.
Synthesis metrics.
Combined-pipeline metrics.

The team can pinpoint regressions to a stage.

A real implementation

A team's RAG eval:

200 query-document-pairs annotated for retrieval.
200 query-answer pairs for synthesis.
Combined: end-to-end test.

Each stage has its own thresholds. The team's debugging gets faster because the failures are localised.

Trade-offs

More evals = more maintenance.
Cleaner debugging compensates.

For RAG features, the two-stage approach is worth the maintenance cost.

What we won't ship

RAG features without retrieval-specific eval.

End-to-end-only eval for RAG.

Skipping the citation-correctness check in synthesis eval.

Synthesis eval that doesn't account for retrieval quality.

Close

Evals for retrieval separate the layers. Retrieval has its own metrics; synthesis has its own. End-to-end combines them. The team's debugging is targeted; quality improvements are measurable.

Evals for retrieval: separating retrieval from synthesis

The two-stage eval

Tooling

Reviewer ritual

A real implementation

Trade-offs

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors