Redefining RAG Evaluation: A Set-Based Approach (Beyond nDCG)

Most “RAG in production” stories skip the most basic question: If I change my retriever / reranker / K / embedding model, how much did my system actually get better? The default answer is still: run nDCG/MAP/MRR and squint at the numbers. That was fine when your “user” was a human scrolling 10 blue links. In RAG, the “user” is the LLM. It gets a fixed set of passages shoved into a prompt. It does not scroll. Anything past the context cutoff effectively doesn’t exist. So the real retrieval question is: ...

November 19, 2025 · 13 min · Etienne D.

How Proper Names Behave in Text Embedding Space

If dense retrieval is “semantic”, why does it work on proper names? When I was building a RAG system over a client’s scientific papers, I noticed something odd. The dense retriever was… kind of good at proper names. In this domain, that’s not supposed to be easy. You often want things like “Which works by [AUTHOR] on [TOPIC]?” and, in practice, dense retrieval alone wasn’t enough — hybrid (BM25 + dense) clearly did better. But even before I added BM25, the dense model already showed a real preference for the right author over impostors. Not perfect, but definitely more than random. ...

November 13, 2025 · 11 min · Etienne D.

The Polite Saboteur: How “Hello” Messes With Your RAG (A Little)

TL;DR I poked modern embedding models (text-embedding-3-large / -small) with real conversational fluff and watched the cosine similarity to the clean query drop, dropping more in French than in English, and more in the small model than the large. This lines up with what RAG benchmarks like MTRAG and CRAG already scream: multi-turn chatter and retrieval noise absolutely wreck downstream QA if you’re not careful. You can fight this in three ways: ...

November 2, 2025 · 8 min · Etienne Dallaire