What a genome RAG system should retrieve over

Defensible answer: retrieve over the annotated variants and the clinical report, plus reference knowledge (gene functions, condition descriptions, guideline text). Do not retrieve over raw FASTQ/BAM (huge, sensitive, not answer-bearing).

Practically, you chunk:

per-variant annotation records
report sections (summary, findings, recommendations)
curated reference docs about genes/conditions

This keeps the index small (thousands of chunks), keeps the sensitive raw sequence out of the LLM path, and produces precise, citable answers.