Evaluate retrieval and generation separately.
Retrieval metrics: Context recall — did retrieval surface the chunks needed to answer? Context precision — were the retrieved chunks actually relevant (vs. noise)? Generation metrics: Faithfulness / groundedness — is every claim in the answer supported by the retrieved context (no hallucination)? Answer relevance — does the answer actually address the question? Measure with a labeled eval set (questions + gold chunks + gold answers) and LLM-as-judge for faithfulness. Frameworks: RAGAS or custom judges. Track these as you change chunking / k / model so you don't regress.
Worked example — one row of the eval set, scored:
The label (ground truth, written by hand)
Question "Is my BRCA1 variant pathogenic?"
Gold chunk report > BRCA1 (the section that SHOULD be retrieved)
Gold answer "Yes — BRCA1 c.68_69del is classified Pathogenic (ACMG)."
What the system actually did
Retrieved [#1] report > BRCA1 [#2] report > MTHFR
Answer "Your BRCA1 c.68_69del variant is Pathogenic [#1]."
Scores
Context recall 1.0 — the gold chunk (BRCA1) was in the retrieved set
Context precision 0.5 — only 1 of 2 retrieved chunks was relevant (MTHFR was noise)
Faithfulness 1.0 — every claim in the answer is supported by [#1]
Answer relevance 1.0 — it answers the question that was asked
Judge verdict: PASS, but precision is leaky. The fix is a knob change — lower k
or add re-ranking so the irrelevant MTHFR chunk drops out — then re-run the whole
eval set to confirm you didn't regress the other rows.
In words you can say out loud: recall asks "did we fetch the right chunk?", precision asks "did we fetch only relevant chunks?", faithfulness asks "is the answer backed by what we fetched?", and relevance asks "did it answer the question?" A real eval set is dozens to hundreds of rows like this.