Redefining RAG Evaluation: A Set-Based Approach (Beyond nDCG)

Moving from ranked lists to prompt sets: a better way to benchmark RAG.

TL;DR

  • The Problem: Traditional metrics like nDCG assume a human scrolling a ranked list. In RAG, the “user” is an LLM consuming a fixed, unordered set of passages.
  • The Solution: A new metric, RA-nWG@K (Rarity-Aware Normalized Weighted Gain). It answers: “Of the best possible evidence in the entire corpus, how much did we successfully pack into the K prompt slots?”
  • The Mechanics: It weights passages by utility (how helpful they are) and rarity (missing a unique fact hurts more than missing redundant ones), ignoring internal ranking order.
  • The Diagnostics: New KPIs PROC and %PROC separate retrieval failures (the answer wasn’t in the candidate pool) from reranking failures (the answer was there, but we didn’t select it).

Most “RAG in production” stories skip the most basic question:

If I change my retriever / reranker / K / embedding model, how much did my system actually get better?

The default answer is still: run nDCG/MAP/MRR and squint at the numbers. That was fine when your “user” was a human scrolling 10 blue links.

In RAG, the “user” is the LLM.

  • It gets a fixed set of passages shoved into a prompt.
  • It does not scroll.
  • Anything past the context cutoff effectively doesn’t exist.

So the real retrieval question is:

Given K slots in the prompt, how much of this query’s best available evidence did we actually pack into those slots?

Rank metrics don’t answer that. They answer: “how pretty is your DCG curve if a human eyeball walks the list?” This post describes a family of metrics designed for that regime:

  • RA-nWG@K – Rarity-Aware Normalized Weighted Gain: “How good is the actual top-K set we fed the LLM compared to an omniscient oracle on this corpus?”
  • PROC@KPool-Restricted Oracle Ceiling: “How good could we have done if we picked the best K-subset from this retrieval pool?”
  • %PROC@K – “Given that ceiling, how much did our actual top-K realize?” (reranker/selection efficiency).

The core assumptions:

  1. The LLM consumes a set of passages.
  2. Order inside that set is a second-order effect.
  3. High-utility evidence is scarce, and missing it hurts more than polishing the rank curve.

In production RAG, you pick a budget K (effectively, a token budget):

  • The retriever gives you a candidate pool (e.g., Top-50 or Top-100).
  • You choose K passages from that pool to stuff into the prompt.
  • The LLM sees one big context, not a ranked list UI.

Inside that K, the first question isn’t “how smooth is the DCG curve?”, it’s:

Under budget K, did we include the rare, decisive passages and enough solid support?

Recent work like Trappolini et al. (Redefining Retrieval Evaluation in the Era of LLMs, 2025) makes the same move: evaluate the utility of the set the LLM actually consumes, rather than the trajectory of a human scrolling the list. They also show that once you condition on which passages are in the context, removing positional discounting has little impact on answer accuracy.

My approach here:

  • Core metric: order-agnostic set quality, normalized per query against a label set.
  • Retrieval vs reranking: teased apart with PROC and %PROC.
  • Distractors and harm: tracked separately, because their impact is model-dependent and moving fast.

2. A metric that matches that model: RA-nWG@K

RA-nWG@K answers, per query:

“How well did our actual top-K set do versus the best K-set an omniscient system could have built from the human-labeled corpus?”

Formally:

RA-nWG@K = (utility of the actual top-K you fed the LLM) ÷ (utility of the global oracle top-K from the labeled corpus)

Interpretation:

  • 1.0 – “For this query, our K-set is as good as the omniscient best-possible K-set.”
  • 0.5 – “We captured half the achievable utility.”

It’s built in three steps.

2.1 Grade passages by utility, not just “relevance”

For each query, you need graded labels on passages:

  • 5 – decisive / clearly contains the key elements needed to answer
  • 4 – highly relevant, substantial information
  • 3 – partially useful / related context
  • 2 / 1 – weak or junk; near-irrelevant, noise, or hard distractors

Think: “If the model saw only this passage, how intrinsically helpful would it be for this query?”

Map these to base utilities:

  • b₅ = 1.0
  • b₄ = 0.5
  • b₃ = 0.1
  • b₂ = b₁ = 0

No ranks yet. Just intrinsic usefulness.

2.2 Make rarity explicit (without letting 3s replace 5s)

Not all grades have the same opportunity cost:

  • If grade-5 passages are rare, missing one is catastrophic.
  • If grade-3 passages are abundant, retrieving ten of them is not a substitute for missing the only 5.

RA-nWG makes this explicit by weighting grades based on how prevalent they are in the corpus for that query:

  1. For each grade g, compute its prevalence p_g in the labeled corpus for that query.

  2. Define a “rarity score” r_g = b_g / p_g^α (α ≈ 1 by default).

  3. Set weights relative to grade-5, with caps to enforce dominance:

    • w₅ = 1.0
    • w₄ = min(r₄ / r₅, cap₄)
    • w₃ = min(r₃ / r₅, cap₃)
    • w₂ = w₁ = 0

    with something like cap₄ = 1.0, cap₃ = 0.25.

If there are no grade-5s at all in the pool, fall back to a fixed conservative scheme (e.g., w₄=1, w₃=0.2) so the metric doesn’t collapse.

Intuition:

  • When 5s are rare, their relative weight dominates.
  • 3s and 4s are capped and can never “pretend” to be a decisive 5, no matter how scarce they are.

Now every passage d has a rarity-aware weight w(d) = w_{g(d)}.

2.3 Compare your top-K to the global oracle top-K

Now define, per query:

  • Global pool $\mathcal{U}$ – all labeled passages for this query.
  • Your top-K set $S_K$ – the K passages you actually fed to the LLM.
  • Global oracle top-K $\mathcal{O}_K$ – the K passages in $\mathcal{U}$ with the highest weights.

Then:

  • Observed utility
    $G_{\text{obs}}(K) = \sum_{d \in S_K} w(d)$
  • Global oracle utility
    $G_{\text{oracle}}(K) = \sum_{d \in \mathcal{O}_K} w(d)$

Finally:

$\text{RA-nWG}@K = G_{\text{obs}}(K) / G_{\text{oracle}}(K)$ (or NA if $G_{\text{oracle}}(K) = 0$).

Because this is done per query, you can macro-average RA-nWG@K across queries without conflating “this query has lots of grade-5s” with “this system is good.”

You can (and should) pair RA-nWG@K with:

  • N-Recall4+@K – fraction of available grade ≥4 passages that appear in the top-K, normalized by $\min(K,\,R_{4+})$.
  • N-Recall5@K – same, focused on grade-5s only.
  • Harm@K – fraction of top-K passages labeled as harmful/junk (e.g., grade 1, or ≤2 depending on rubric).

Together, that answers:

  • “How close are we to oracle utility?” (RA-nWG@K)
  • “How much of the best evidence did we cover?” (N-Recall)
  • “How dirty is the context?” (Harm@K)

3. Why rank metrics fail the RAG reality test

Classical rank metrics like nDCG/MAP/MRR bake in assumptions that don’t survive contact with long-context LLMs and RAG.

3.1 Monotone position discount vs long-context behavior

Rank metrics assume:

  • position 1 ≫ 2 ≫ 3 ≫ … in a smooth, monotone way.

That’s true for humans scanning a SERP. LLMs are weirder.

Empirical work like Lost in the Middle shows:

  • Many models perform best on information near the beginning and end of a long context,
  • with a dip in the middle, not a nice monotone decay.

On the theory side:

  • Standard RoPE introduces distance-dependent decay in how precisely attention can distinguish positions as they get further apart.
  • Naive long-context hacks (linear scaling, interpolation) can introduce aliasing/crowding: far-out positions get squashed into a compressed region of the embedding space.
  • Hence all the NTK-aware RoPE variants, LongRoPE, etc., trying to fight that.

In other words, “position 1 is twice as important as 2, three times as 3, etc.” is just not how these systems behave.

Trappolini et al. push this further in practice: in their RAG setups, removing positional discounting from the metric barely changed correlation with answer quality once you conditioned on which passages were in the context. The set mattered; the exact rank weighting mostly did not.

Hard-coding a fixed, monotone discount curve into your metric in 2025 is wishful thinking. RA-nWG sidesteps the whole mess by being order-agnostic at its core – you can still study positional quirks, but you don’t bake them into the definition of “good retrieval” for everyone.

3.2 Label-distribution confounding (“The missing 5”)

Rank metrics also silently confound system quality with how many good documents exist for each query.

Consider:

  • Query A – exactly one decisive passage (grade-5), plus many grade-3 “okay but not enough” passages.
  • Query B – ten different grade-5 passages in a redundant corpus.

Take a system that:

  • often finds a grade-5 for Query B,
  • regularly misses the single grade-5 for Query A and only retrieves grade-3s there.

Intuitively, it’s doing much worse on Query A: missing the only decisive passage is a bigger sin than picking one of many 5s.

Rank metrics mostly see:

  • “Nice DCG on B, worse DCG on A,” then average and call it a day.
  • They don’t normalize for how much high-utility evidence was even available per query.

RA-nWG doesn’t have this problem:

  • Both Query A and Query B get their own oracle G_oracle(K), computed from their labeled corpus slice.
  • Your RA-nWG@K is G_obs(K) / G_oracle(K) for each query.
  • Only then do you macro-average across queries.

You’re no longer benchmarking your label distribution. You’re benchmarking how close your actual top-K came to the best that was achievable for each query.

3.3 Distractors: real, but a moving target

Traditional IR mostly treats non-relevant docs as harmless padding: the user just scrolls past them.

In RAG, “non-relevant” can mean:

  • A near-miss that looks very plausible but is wrong.
  • Content that anchors the model on the wrong entity/date/claim.
  • Toxic or adversarial passages that the model might happily quote.

There’s a growing pile of work showing that hard distractors can torpedo answer accuracy even when the right evidence is in the context.

The catch is: distractor sensitivity is model-dependent and changing fast.

  • Newer, larger, better-prompted models are already much less brittle than the baselines in early RAG papers.
  • The impact of junk depends heavily on formatting, task, language, and the exact generator.

So instead of baking one fixed “distractor penalty” into the core metric, a more stable approach is:

  • Use RA-nWG@K to measure presence and utility of good evidence.
  • Track a separate Harm@K (fraction of top-K passages with “junk/harmful” labels) if you care about brittleness.
  • If you need a more detailed order- and harm-aware score for a specific deployment, something UDCG-like is a great secondary diagnostic.

4. Where is the headroom? PROC and %PROC

RA-nWG@K tells you how good your actual top-K is versus the global oracle. It doesn’t tell you why it isn’t 1.0:

  • Did the retriever fail to pull the right evidence into the pool at all?
  • Or did the reranker/selection logic fail to choose the right K from a good pool?

That’s where PROC@K and %PROC@K come in.

4.1 PROC@K – retrieval ceiling from a given pool

For each query, define:

  • U – the global labeled corpus slice.
  • P – the retrieval pool: the documents you actually fetched (e.g., Top-50 from dense + BM25).
  • S_K – the actual top-K you fed to the LLM (S_K ⊆ P).

We already had the global oracle:

  • G_oracle(K) = max over S ⊆ U, |S|=K of Σ w(d).

Now define a pool-restricted oracle:

  • G_pool(K) = max over S ⊆ P, |S|=K of Σ w(d).

This gives you:

  • RA-nWG@K = G_obs(K) / G_oracle(K) (global, “how close to omniscient best?”)

  • PROC@K = G_pool(K) / G_oracle(K) (retrieval ceiling, “how much of the global best could we achieve given this pool?”)

PROC@K is purely about retrieval quality:

  • If PROC@K is low, the right stuff isn’t making it into the pool. No amount of reranking can fix that.

4.2 %PROC@K – reranker / selection efficiency

You can now factor RA-nWG@K into:

  • global ceiling → pool ceiling → actual realized score.

Define:

%PROC@K = RA-nWG@K / PROC@K (when PROC@K > 0)

Interpretation:

  • %PROC@K answers: “Given that this pool could support PROC@K (as a fraction of the global oracle), how much of that potential did our actual top-K selection realize?”

Put together:

  • Retrieval quality ⇒ PROC@K
  • Reranking/selection quality ⇒ %PROC@K
  • Combined effect ⇒ RA-nWG@K

The diagnostic matrix:

  • Low PROC@K

    • Retrived pool can’t support a good K-set.
    • Fix retriever: embeddings, hybridization, query rewriting, ANN recall, etc.
  • High PROC@K, low %PROC@K

    • Retrived pool could support a strong RA-nWG@K, but your selection logic isn’t using that potential.
    • Fix reranker / selection: cross-encoder, scoring features, filters, near-duplicate handling.

5. Simple Python example for RA-nWG@K

Here’s a minimal, self-contained example of computing RA-nWG@K for a single query:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from collections import Counter

def compute_weights(grades, alpha=1.0, cap4=1.0, cap3=0.25):
    """
    grades: list of integers in {1,2,3,4,5} for *all* labeled passages for this query.
    Returns: dict grade -> weight w_g.
    """
    base = {5: 1.0, 4: 0.5, 3: 0.1, 2: 0.0, 1: 0.0}
    N = len(grades)
    counts = Counter(grades)

    # prevalence per grade in the labeled corpus slice
    p = {g: counts[g] / N for g in base.keys()}

    # handle "no grade-5" as a special case
    if counts[5] == 0:
        return {5: 1.0, 4: 1.0, 3: 0.2, 2: 0.0, 1: 0.0}

    # rarity scores
    r = {g: (base[g] / (p[g] ** alpha)) if p[g] > 0 else 0.0 for g in base}

    w5 = 1.0
    w4 = min(r[4] / r[5], cap4) if r[5] > 0 else 0.0
    w3 = min(r[3] / r[5], cap3) if r[5] > 0 else 0.0

    return {5: w5, 4: w4, 3: w3, 2: 0.0, 1: 0.0}

def ra_nwg_at_k(global_labels, topk_labels, K, alpha=1.0):
    """
    global_labels: list of grades for ALL labeled passages (corpus slice for this query)
    topk_labels: list of grades for the K passages you actually fed the LLM
    K: cut-off (len(topk_labels) should be <= K; pad with worst if needed)
    """
    if not global_labels:
        return None  # NA

    weights = compute_weights(global_labels, alpha=alpha)
    
    # observed gain: sum of weights in actual top-K
    G_obs = sum(weights[g] for g in topk_labels)

    # global oracle gain: take the K largest weights from the corpus
    all_weights = sorted((weights[g] for g in global_labels), reverse=True)
    K_eff = min(K, len(all_weights))
    if K_eff == 0:
        return None  # NA

    G_oracle = sum(all_weights[:K_eff])

    if G_oracle == 0:
        return None  # NA

    return G_obs / G_oracle

# Example:
global_labels = [5, 4, 4, 3, 3, 3, 2, 1]   # entire labeled pool for this query
topk_labels   = [4, 3, 3, 3]               # what we actually fed into the prompt (K=4)
K = 4

score = ra_nwg_at_k(global_labels, topk_labels, K)
print("RA-nWG@4 =", score)

6. Formal definitions

Setup (per query q)

$$ \textbf{Labels: } g \in \\{1,2,3,4,5\\}\quad $$

Utility grading scale

  • 5 = responds clearly / contains the key elements
  • 4 = highly relevant, substantial information
  • 3 = partially relevant; related notions but insufficient
  • 2 = weak relevance; tangential allusions
  • 1 = distractors and harm
$$ \textbf{Pool size: } N\quad \text{(graded passages for }q\text{)} $$$$ \textbf{top-K: } \mathrm{TopK}(q) $$

Base utilities (stationary, order-free)

$$ b_5=1.0,\quad b_4=0.5,\quad b_3=0.1,\quad b_2=b_1=0. $$

Counts, proportions, rarity

$$ n_g = \\#\\{\text{passages of grade }g\\},\qquad p_g = \frac{n_g}{N}. $$

If $p_g=0$, treat $r_g=0$.

Rarity score (alpha = 1 by default)

$$ r_g = \frac{b_g}{p_g^\alpha},\quad \alpha=1. $$

We set $\alpha=1$ by default for proportional, interpretable prevalence correction; $\alpha=0$ reduces to no rarity. Caps ($\mathrm{cap}_4{=}1.0$, $\mathrm{cap}_3{=}0.25$) enforce grade-5 dominance and bounded compensation.

Weight normalization (relative to grade-5) with caps

Defaults: $\mathrm{cap}_4 = 1.0$, $\mathrm{cap}_3 = 0.25$.

$$ w_5 = 1,\qquad w_4 = \min\!\Big(\frac{r_4}{r_5},\ \mathrm{cap}_4\Big),\qquad w_3 = \min\!\Big(\frac{r_3}{r_5},\ \mathrm{cap}_3\Big),\qquad w_2 = w_1 = 0. $$

Fallback when no grade-5 exists in the pool ($n_5 = 0$)

$$ \text{If } n_5=0:\quad w_5=1,\\; w_4=1,\\; w_3=0.2,\\; w_2=w_1=0. $$

This fallback is applied only when $n_5=0$, preventing undefined normalization by $r_5$ and keeping the metric informative on 0×grade-5 queries.

Observed and ideal gains at cut K

$$ G_{\mathrm{obs}}(K) = \sum_{d \in \mathrm{TopK}(q)} w_{\\,g(d)}. $$$$ G_{\mathrm{ideal}}(K) = \max_{S \subseteq \text{pool},\\, |S|=K} \ \sum_{d \in S} w_{\\,g(d)} \\;=\\; \sum_{i=1}^{K} w_{\\,g_i^\star}\quad \text{(take the K highest }w_g\text{ in the pool).} $$

Main metric: RA-nWG@K (rarity-aware, normalized within-query, set-based)

$$ \text{RA-nWG}@K \\;=\\; \begin{cases} \dfrac{G_{\mathrm{obs}}(K)}{G_{\mathrm{ideal}}(K)}, & \text{if } G_{\mathrm{ideal}}(K) > 0,\\\\\[4pt] \mathrm{NA}, & \text{otherwise.} \end{cases} $$

Complementary coverage and precision KPIs

$$ R_{4+} = n_4 + n_5, \qquad R_{5} = n_5. $$$$ G_{4+}(K) = \sum_{d \in \mathrm{TopK}(q)} \mathbf{1}[g(d)\ge 4], \qquad G_{5}(K) = \sum_{d \in \mathrm{TopK}(q)} \mathbf{1}[g(d)= 5]. $$$$ \text{N-Recall}\_{4+}(K) = \begin{cases} \dfrac{G\_{4+}(K)}{\min\\{K, R\_{4+}\\}}, & R\_{4+} > 0,\\\\\[4pt] \text{NA}, & \text{otherwise,} \end{cases} \qquad \text{N-Recall}\_{5}(K) = \begin{cases} \dfrac{G\_{5}(K)}{\min\\{K, R\_{5}\\}}, & R\_{5} > 0,\\\\\[4pt] \text{NA}, & \text{otherwise.} \end{cases} $$$$ \text{Precision}\_{4+}(K) = \frac{G\_{4+}(K)}{K}, \qquad \text{Harm}(K) = \frac{1}{K}\sum\_{d \in \mathrm{TopK}(q)} \mathbf{1}[g(d)\le 2]. $$

In a real system, you’d:

  • use the same weights to compute PROC@K (by restricting the oracle to the retrieval pool), and
  • divide RA-nWG@K by PROC@K to get %PROC@K.

But the core idea stays simple:

  • RA-nWG@K – “How close is this K-set to what an omniscient system could have done on this corpus?”
  • PROC@K – “What’s the best we could have done given this retrieval pool?”
  • %PROC@K – “How much of that pool potential did our actual top-K selection realize?”

I go into more detail in the preprint (https://arxiv.org/abs/2511.09545); if you have ideas on how to improve this setup, I’d genuinely like to hear them.

Built with Hugo
Theme Stack designed by Jimmy