Jaypore Labs
Back to journal
Engineering

Evals for retrieval: separating retrieval from synthesis

RAG eval needs two layers — retrieval-only and full-pipeline. Separating them clarifies failures.

Yash ShahMarch 18, 20262 min read

A team's RAG eval was end-to-end. When it failed, the team didn't know whether retrieval was wrong or synthesis was wrong. They'd spend hours digging.

Separating retrieval from synthesis in eval reveals what's failing.

The two-stage eval

Stage 1 — Retrieval eval.

  • Input: query.
  • Output: retrieved documents.
  • Metric: precision@k, recall@k, MRR.
  • Pass criterion: relevant docs in top-k.

Stage 2 — Synthesis eval.

  • Input: query + retrieved documents.
  • Output: synthesised answer.
  • Metric: accuracy, citation correctness, completeness.
  • Pass criterion: answer matches expected.

Each stage tests one thing. End-to-end eval combines both but doesn't isolate.

Tooling

  • Retrieval eval: vector-DB tooling provides precision/recall metrics.
  • Synthesis eval: the team's standard LLM-eval tooling.

The two-stage approach uses two different evaluators.

Reviewer ritual

PR review for RAG changes:

  • Retrieval metrics.
  • Synthesis metrics.
  • Combined-pipeline metrics.

The team can pinpoint regressions to a stage.

A real implementation

A team's RAG eval:

  • 200 query-document-pairs annotated for retrieval.
  • 200 query-answer pairs for synthesis.
  • Combined: end-to-end test.

Each stage has its own thresholds. The team's debugging gets faster because the failures are localised.

Trade-offs

  • More evals = more maintenance.
  • Cleaner debugging compensates.

For RAG features, the two-stage approach is worth the maintenance cost.

What we won't ship

RAG features without retrieval-specific eval.

End-to-end-only eval for RAG.

Skipping the citation-correctness check in synthesis eval.

Synthesis eval that doesn't account for retrieval quality.

Close

Evals for retrieval separate the layers. Retrieval has its own metrics; synthesis has its own. End-to-end combines them. The team's debugging is targeted; quality improvements are measurable.

Related reading


We build AI-enabled software and help businesses put AI to work. If you're tightening RAG evals, we'd love to hear about it. Get in touch.

Tagged
EvalsRAGEngineeringOutput TestingRetrieval
Share