Offline eval is the lab; production observability is the field. You log what real users ask, catch failures live, and feed them back into the eval set.
Log one trace per query — write the row locally first (the app is on-device, and keeping the record on the phone keeps PHI on the phone): question the query text retrieved [{chunk_id, score}, ...] — what retrieval returned top_score best score; low = poor-coverage flag answer the generated answer citations which chunks the answer used config {embed_model, k, chunk_version, llm} — to attribute regressions latency_ms feedback thumb +1 / -1; reported judge faithfulness score from a sampled LLM-as-judge Local-first storage: On the phone — the FULL trace (incl. question + answer) in SQLite, next to the index. It is the user's own data and powers their history. To the server — only DE-IDENTIFIED telemetry: chunk ids, scores, config, latency, thumb, judge flags. Never the question or answer text. Logging is a threat surface, so the outbound guard applies to logs too. How you track low accuracy — combine signals, since no single one is enough: Explicit feedback — thumbs up/down, "report this answer" Implicit signals — the user rephrases, abandons, or ignores the citation LLM-as-judge — faithfulness check on a SAMPLE of live answers (no labels) Retrieval health — top_score under a threshold = low coverage / should-abstain Abstention rate — spikes mean missing content Close the loop: filter the failures (thumbs-down, low-coverage, judge-flagged), review them (expert review for anything clinical), and turn them into new eval rows. Low-coverage questions point to content gaps; a bad-config slice points to a regression to roll back. Production failures are your best labeled data.
Tools like Langfuse, LangSmith, Arize Phoenix, and Helicone give you tracing and metrics off the shelf — but the PHI split (what you log, and where) is yours to enforce.