When Small Talk Hurts Retrieval: Quantifying Conversational Noise in RAG

Do we really need query rewriting? Fixing the "Politeness Penalty" without the latency.

TL;DR

  • I poked modern embedding models (text-embedding-3-large / -small) with real conversational fluff and watched the cosine similarity to the clean query drop, dropping more in French than in English, and more in the small model than the large.

  • This lines up with what RAG benchmarks like MTRAG and CRAG already scream: multi-turn chatter and retrieval noise absolutely wreck downstream QA if you’re not careful.

  • You can fight this in three ways:

    1. Beat the noise out with regex (strip fillers before embedding: cheap, crude, effective-ish).
    2. Rewrite queries with an LLM (much smarter, but adds cost and latency, even with a small model).
    3. Tell the embedding model what you actually want via instruction-tuned embeddings and input_type="query"/"document" (Voyage, BGE-M3, etc.).

…and so do we still need query rewriting?


We tend to assume embedding models look past wording and capture the underlying intent of a query.

Reality: your user types

“Hey! Quick question, hope you’re doing well, could forests actually regulate the climate, or is that just a myth?”

and your retriever has to somehow ignore half of that sentence to do its job.

That extra chatter is conversational noise: greetings, apologies, hedging, and social padding. From an information-retrieval perspective, most of it is useless. But transformers don’t get to say “nah”, every token is processed, every token ends up influencing the final embedding to some degree.

Meanwhile, RAG research has already shown that noise is not just an aesthetic problem (The Power of Noise , MTRAG ,CRAG ..)

So I wanted to look at the microscopic end of this story:

How much does real human politeness move a modern embedding vector?


A Small Experiment: The “Politeness Penalty”

I ran a simple test with OpenAI’s text-embedding-3-large and text-embedding-3-small (general-purpose embedding models meant for semantic search and RAG)

Setup

  • Core question in French and English:

    “Can forests really regulate the climate?”

  • Three variants per language:

    1. 0: No noise: bare question.
    2. 2: Moderate noise: short greeting + “quick question…” preamble.
    3. 4: High noise: long, conversational preamble (“sorry for the kinda random question, I’m on the train…”).
  • For each variant, I computed the cosine similarity between the noisy query and a short, straight-to-the-point answer sentence.

So I’m measuring how far the query embedding moves when the only thing you change is human chit-chat.

Results

LanguageNoise LevelExample (Simplified)Cosine Sim (Large)Cosine Sim (Small)
🇫🇷 French0: No noise“Les forêts peuvent-elles réguler le climat ?”0.8180.936
2: Moderate“Salut ! Petite question rapide… selon la science…”0.6530.757
4: High noise“Hello ! Désolé pour la question un peu random…”0.5220.559
🇬🇧 English0: No noise“Can forests really regulate the climate?”0.8280.908
2: Moderate“Hi! Quick question: according to science…”0.7490.806
4: High noise“Hello! Sorry for the kinda random question…”0.5930.634

Average drop from no-noise → high-noise:

LanguageΔ (Large)Δ (Small)
🇫🇷 French−0.296−0.377
🇬🇧 English−0.235−0.274

For this specific query and setup, we see:

  • Conversational padding always drags similarity down.
  • The small model is consistently more sensitive than the large one.
  • The drop is bigger in French than in English for both models.

This is not a full benchmark. But it’s a realistic example that says: if your users talk like humans, your “perfect” retrieval is operating with a slightly bent compass.


Why Might French Suffer More?

We don’t have OpenAI’s train-data histogram, so this part is necessarily hypothesis.

We do know from multilingual benchmarks (MTEB, XTREME, etc.) and documentation that:

  • Models generally perform best on high-resource languages (English), with lower robustness on underrepresented ones.
  • BGE-M3, for example, is explicitly advertised as a multilingual embedding model that supports 100+ languages, but the docs still emphasize different retrieval behaviors and tuning per language/domain. (Hugging Face)

A plausible story consistent with this:

The models have seen far more noisy, informal English (forums, chat logs, Q&A) than noisy French. In English, phrases like “Hi, quick question” behave closer to “soft stopwords” for retrieval tasks. In French, similar fillers (“Désolé pour la question…”) may be treated as more semantically meaningful, so they tug the embedding further away from the core topic.

It’s not proven by this one toy experiment, but the pattern matches what we see in larger multilingual evaluations: English is usually more robust under perturbation; other languages are catch-up mode.


Three Ways to Fight Conversational Noise

Now to the engineering question: what do we do about this?

Because “tell users to stop being polite” is… not a UX strategy.

1. Beat the Noise Out With Regex

The blunt instrument:

  • Maintain a small list of fillers: "hi", "hello", "quick question", "hope you're doing well", "désolé pour la question", etc.
  • Strip them before embedding using regex or a lightweight NLP pre-processor.
  • Optional: also trim emojis and common discourse markers.

Pros:

  • Cheap: one pass over the string; no extra model calls.
  • Fast: effectively free in latency compared to embedding / LLM steps.
  • Predictable: you control exactly what gets removed.

Cons:

  • Brittle: users are inventive; you’ll miss tons of variants.
  • Easy to overshoot and remove meaningful content (“sorry for the late payment” vs a generic apology).
  • Doesn’t resolve deeper conversational issues: anaphora, topic drift, “the second one we talked about.”

People probably do this in production; it’s not pretty, but as a first line of defense, it’s fine.


2. LLM Query Rewriting: Smart, But Not Free

The smarter play is query rewriting with an LLM:

Take the raw conversational history + the latest user message → ask a small LLM to rewrite it into a clean, standalone search query.

This is now standard in many “advanced RAG” write-ups: step-back prompting, multi-query rewriting, conversation consolidation, etc.

Example behavior:

“Hey, quick question, remember the second product we talked about earlier, how does its battery life compare to the one from 2022?” → “Compare the battery life of and (2022 model).”

Pros:

  • Actually understands the conversation: resolves “it”, “the second one”, “that guy”.
  • Can remove fillers semantically instead of with dumb string rules.
  • Can generate multiple reformulations (multi-query RAG) to reduce brittleness.

Cons (people hand-wave these way too often):

  • Latency:

    • Even a small LLM adds a new forward pass before retrieval.
    • In many production reports, query rewriting + retrieval + reranking together are a major contributor to end-to-end latency.
  • Cost:

    • You pay tokens for the conversation history + the rewritten output.
    • On high-traffic systems, that’s non-trivial compared to a single embedding call.
  • Failure modes:

    • A bad rewrite can inject hallucinated details that weren’t in the original user query, which then get “locked in” by retrievers.

Rewriting is powerful, and in multi-turn settings it’s often necessary to even make the query well-defined. But it is not free, and pretending it’s free is how you end up with a RAG stack that feels slow and expensive.


3. Tell the Embedding Model What You Actually Want

The third line of defense is to stop treating embeddings as black boxes and explicitly condition them for retrieval.

This is where instruction-tuned embeddings and input_type-style APIs come in.

Voyage: input_type="query" / "document"

Voyage’s embedding models (voyage-3, voyage-3.5, etc.) expose an input_type argument: "query" or "document".

Under the hood:

  • input_type="query": model is told “represent this as a retrieval query.”
  • input_type="document": model is told “represent this as retrievable content.”

Docs and public evals show that:

  • These models are explicitly optimized for retrieval quality on MTEB-style benchmarks.
  • Voyage claims voyage-3 and 3.5 outperform OpenAI’s v3 embeddings on a range of domains, including multilingual and code.

We don’t have a paper saying “this guarantees immunity to greetings,” but instruction-tuning for retrieval is designed to focus representations on the part of the text that matters for the task, and to be less sensitive to superficial phrasing.

BGE-M3: Instruction-tuned, multi-functional

On the open-source side, BGE-M3 is a multi-function embedding model that supports:

  • dense retrieval,
  • lexical matching, and
  • multi-vector interaction, across many languages.

Earlier BGE models recommended adding explicit instructions to queries (“retrieve passages that answer the following question”), similar in spirit to input_type="query". BGE-M3 relaxes this requirement but still benefits from being used in a retrieval-aware way.

These models don’t magically erase noise, but:

You’re at least telling the model what job it is doing: “summarize this chat” vs “search with this text” are different tasks, and the embedding should reflect that.

Used properly, that should reduce how much a polite preamble gets to tug the vector away from the real intent.


So… Do We Still Need Query Rewriting?

Time to circle back to the TL;DR question.

Given:

  • measurable cosine drift from conversational padding (especially in small and non-English embeddings),
  • strong evidence from MTRAG/CRAG that interactional + informational noise hurts end-to-end RAG,
  • and the existence of better embedding models (Voyage, BGE-M3…) that already bake in some robustness,

is query rewriting still worth it once you have regex cleaning and instruction-tuned embeddings?

My honest answer right now:

I don’t know yet, at least not in a way that’s calibrated against quality ⟷ latency ⟷ cost for a specific production setup.

On paper:

  • Regex + instruction-tuned embeddings are the cheapest, lowest-latency moves.
  • LLM rewriting is likely to give the best semantic cleanup, especially in multi-turn dialogues and for ambiguous queries, but you pay for it in both tokens and wall-clock time.

The next step is experiments that look like this:

  1. Fix a corpus + evaluation set

  2. Compare pipelines:

    • Baseline: raw conversational query → embedding → retrieval.
    • Regex only.
    • Instruction-tuned embeddings only.
    • Regex + instruction-tuned.
    • Regex + instruction-tuned + LLM rewriting.
  3. Measure:

    • retrieval metrics,
    • answer quality / truthfulness,
    • latency per request,
    • cost per 1k queries.

Until I’ve run that properly, all I can say is:

  • Conversational noise is definitely not free.
  • Modern embeddings and simple cleaning definitely help.
  • Query rewriting is probably still necessary for some classes of queries (multi-turn, ambiguous, long-tail)…

…but I still need to test exactly what the gains are, and whether they make sense through the lens of quality, latency, and cost for the kinds of systems I actually care about.

Built with Hugo
Theme Stack designed by Jimmy