Turning Noise into Signal: Building RAG "Silver Sets" with Plackett-Luce and Lock-DAGs

TL;DR

The Problem: building RAG evaluation sets (“Golden Sets”) requires ranking relevant documents. Humans are expensive; LLMs are cheap but noisy and often inconsistent ($A > B > C > A$).
The Fix: treat the LLM not as an oracle, but as a noisy sensor in an active learning loop.
The Math: sample small batches of documents ($k=5$) and fit an online Plackett–Luce model to estimate global scores and, crucially, uncertainty variances.
The Guardrails: only “lock” a pairwise ranking ($A$ is better than $B$) when their confidence intervals strictly separate. This builds a Directed Acyclic Graph (DAG).
The Result: a stable, topologically sorted Top-20 list that resolves contradictions automatically, stopping when the ranking stabilizes.

Evaluation is the bottleneck of RAG (Retrieval-Augmented Generation). To know if your retrieval system is working, you need a “Golden Set”—a query and a definitive list of the top relevant documents.

Traditionally, you have two options: pay humans to label thousands of query-document pairs (high quality, high cost, slow), or ask an LLM to rank the whole list (fast, cheap, but prone to hallucination and order bias).

This post is an engine-room tour of Stage 6 of our pipeline, rag-gs. rag-gs is an automated system for generating synthetic evaluation datasets. By the time we hit Stage 6, we have a specific query and a large pool of potentially relevant candidate documents. Our goal is to distill that noisy pool into a stable, high-quality Top-20 ranking—our “Silver Set”—without a human in the loop.

Here is how we solved the problem of LLM inconsistency using probabilistic modeling and graph theory.

The Problem: LLMs Are Noisy Sensors

If you ask GPT-4 to rank 50 documents in one pass, it often fails due to context window limits and “lost-in-the-middle” phenomena. If you ask it to rank them in pairs, you run into the Condorcet Paradox: the model might say $A > B$, $B > C$, but $C > A$.

Cycles are fatal for ranking. You cannot build a ground truth on circular logic.

Instead of trying to prompt the noise away, we accepted it. We reframed the ranking task as an active learning optimization problem. We treat the LLM as a noisy sensor providing evidence. We aggregate that evidence until the signal overwhelms the noise.

The Mathematical Core

To aggregate noisy listwise comparisons, we use the Plackett–Luce (PL) model.

The PL model assumes that the probability of selecting item $i$ first from a set $S$ is proportional to its underlying “skill” or score parameter $\theta_i$:

$$ P(i \mid S) = \frac{e^{\theta_i}}{\sum_{j \in S} e^{\theta_j}} $$

We perform Bayesian updates on these parameters. However, knowing the score isn’t enough; we need to know how confident we are in that score.

Modeling Uncertainty

We approximate the posterior distribution of the scores as Gaussian. We track the mean parameter $\mu_i$ and the variance $\sigma^2_i$.

We estimate the variance using the inverse of the Fisher Information, which essentially tracks how much “information” (wins and losses) we’ve observed for a specific document. Roughly speaking, if document $A$ has won 50 comparisons and lost 50, our uncertainty $\sigma_A$ shrinks.

Using a standard confidence interval (e.g., $z=1.96$ for $\approx 95%$), we calculate the bounds:

Lower Confidence Bound (LCB): $\mu_i - 1.96\sigma_i$
Upper Confidence Bound (UCB): $\mu_i + 1.96\sigma_i$

This uncertainty quantification is the engine of our stability.

The Mechanism: A Trace

Let’s look at a concrete micro-example to see how the system handles the transition from noise to order.

Imagine we have a pool of documents: ${A, B, C, D, E}$. Initially, they all have equal scores and high uncertainty (wide variance).

Iteration 1: Sampling & Ranking

We use an acquisition function (balancing score maximization and uncertainty reduction) to select a batch of 5 items.

Batch: $[A, B, C, D, E]$
LLM output: “A is best, followed by C, B, E, then D.” ($A \succ C \succ B \succ E \succ D$)

Iteration 2: The Update

We update the PL model.

$A$ gets a score boost; its variance drops.
$D$ gets a score penalty; its variance drops.
$C$ and $B$ are in the middle.

Iteration 3: The “Lock” Logic

This is where rag-gs differs from standard sorting. We don’t just trust the scores; we look at the bounds.

Scenario 1 (Strong signal): $A$ has a very high LCB, and $D$ has a very low UCB. The intervals do not overlap. We LOCK the edge $A \to D$ in our DAG. This relationship is now immutable.
Scenario 2 (Weak signal): $C$ has a higher score than $B$, but their confidence intervals overlap heavily. We DO NOT lock this edge. We treat their relative ordering as unknown.

Handling Contradictions

If a new LLM call suggests $D > A$ (contradicting our locked $A \to D$), we reject the new edge as noise because the accumulated evidence for $A \to D$ is statistically significant. If we haven’t locked the edge yet, we sample more until the uncertainty reduces enough to make a decision.

Stopping Criterion

We continue this loop—sample, rank, update, lock—until the Top-20 items in the DAG are stable.

Specifically, we stop when the set of the top 20 documents has not changed for $T$ consecutive iterations. This ensures we don’t waste tokens refining the order of document #45 vs #46, which is irrelevant for a Top-20 “Silver Set.”

Conclusion

By combining the probabilistic “softness” of Plackett–Luce with the “hard” consistency constraints of a DAG, rag-gs produces evaluation sets that are:

Transitive: no cycles ($A > B > C > A$).
Robust: outlier hallucinations are washed out by accumulated evidence.
Calibrated: we know when the model is unsure, and we pay for more compute (samples) only where it matters.

It is not a perfect substitute for a domain expert carefully reading 100 documents. But as a “Silver Set” generator for rapid RAG iteration, it provides a rigorous, mathematical baseline that far exceeds single-shot LLM ranking.

References & Further Reading

Plackett–Luce model: Luce, R. D. (1959). Individual Choice Behavior: A Theoretical Analysis.
Pairwise ranking: Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs.
Active learning for ranking: Burges, C. J. et al. (2005). Learning to rank using gradient descent.
Bayesian skill rating: Herbrich, R., Minka, T., & Graepel, T. (2007). TrueSkill™: A Bayesian Skill Rating System.