Measuring real impact, not demo-ware

A demo that answers three cherry-picked questions proves nothing. Define an offline eval set (questions + gold chunks + gold answers) and track context recall/precision and faithfulness as you change the system.

In production, track privacy-preserving signals: abstain rate citation-click/expand rate thumbs-up/down task completion Tie the feature to a user outcome ("users understand their report without a 2-week genetic-counselor wait") and measure that. Be ready to say "here's how I'd know it's working, and here's how I'd know it regressed."