Re-ranking

A first-stage retriever optimizes recall cheaply — get the right chunk into the top 50. A re-ranker (usually a cross-encoder that reads query and chunk together) then re-scores those candidates for precision, pushing the best into the top-5 that actually go in the prompt.

It improves answer quality but adds latency and compute. On mobile you either use a small re-ranker, do it server-side, or skip it and rely on good hybrid retrieval.

Mental model: Bi-encoder (embeddings) — recall at scale Cross-encoder (re-ranker) — precision on a shortlist