The RAG pipeline end to end — CrashBytes Training

Retrieval-Augmented Generation grounds an LLM's answer in retrieved source text instead of relying on the model's parametric memory — so answers stay current, citable, and scoped to your data.

It runs in two phases: build-time and query-time. Build the index once per document, then answer many questions against it. (Build-time is sometimes called "offline" — confusing on mobile, because here it means done ahead of time, not without a network.)

Analogy: an open-book exam. Build-time is making your study guide before the test — read the book, rip it into one-idea notecards, label each so you can find it by meaning, and organize them in a box. Query-time is taking the test — a question appears, you flip to the few relevant cards, and write your answer citing them.

BUILD-TIME  (once per document — make the study guide)
  1. Ingestion   — load the source documents
  2. Chunking    — split them into retrievable units (the notecards)
  3. Embedding   — turn each chunk into a vector so it's findable by meaning
  4. Indexing    — store the vectors for fast similarity search (the card box)

QUERY-TIME  (once per question — take the test)
  5. Retrieval   — embed the query, find the top-k most similar chunks (optionally re-rank)
  6. Generation  — put the retrieved chunks + the query in the prompt; the LLM answers, ideally with citations

ACROSS BOTH
  7. Evaluation / observability — measure faithfulness and retrieval quality, log everything

Build-time (1–4) happens ahead of time, once per document; query-time (5–6) runs live, on every question; 7 wraps both, so you can tell a retrieval failure (the right chunk never surfaced) from a generation failure (the model mishandled a chunk it was given).

Build-time vs query-time is about WHEN a stage runs, not WHERE. Where each stage runs — on the device or on a server — is a separate decision (see On-Device & Mobile). On the genomics app, build-time and retrieval both run on the phone so the raw genome never leaves it; only generation might call out, sending just the question plus the retrieved snippets.

Genomics example: ingestion is the VCF-derived clinical report, retrieval finds the relevant variant/annotation sections, and generation answers the patient's question citing those sections.