Chunking decides what a "retrievable unit" is.
Common strategies: Fixed-size token windows with overlap — simple, but cuts mid-sentence and mid-table Recursive / semantic splitting — break on structure (paragraphs, headings) Document-structure-aware chunking — respect the document's own units For structured genomic data, chunk on natural records — one variant plus its annotations as a unit, one report section as a unit — not arbitrary 512-token windows. A naive window can split a variant from its pathogenicity classification and destroy meaning.
Two rules of thumb: Keep chunks self-contained — carry the heading/gene context into the chunk; add overlap only when needed Chunk size is a recall/precision tradeoff — bigger chunks improve recall but dilute precision and burn context window