Redefining RAG Evaluation: A Set-Based Approach (Beyond nDCG)
Most “RAG in production” stories skip the most basic question: If I change my retriever / reranker / K / embedding model, how much did my system actually get better? The default answer is still: run nDCG/MAP/MRR and squint at the numbers. That was fine when your “user” was a human scrolling 10 blue links. In RAG, the “user” is the LLM. It gets a fixed set of passages shoved into a prompt. It does not scroll. Anything past the context cutoff effectively doesn’t exist. So the real retrieval question is: ...