How Proper Names Behave in Text Embedding Space

Quantifying the fragile "name signal" in modern vector spaces.

TL;DR

  • The Surprise: Dense “semantic” retrievers work better on proper names than expected. In a RAG system over scientific papers, author names showed real signal—not perfect, but far from random.
  • The Experiment: Built a diagnostic with synthetic authors (to avoid memorization) and measured name margin (how much the model prefers the right author) vs topic margin. Result: names carry ~50% of topic’s separation power across 6,000 test runs.
  • The Fragility: Replace real names with gibberish IDs (ID-AB12F3) and the name margin collapses by ~70%. Light formatting (case, accents, initials) barely hurts. The signal isn’t “semantic understanding”—it’s surface form + subword familiarity.
  • The Takeaway: Proper names work as high-weight lexical anchors in embedding space: powerful when the string matches, dangerous when it mutates. For rare entities or precise attribution, you’ll want hybrid retrieval (dense + BM25) or metadata filters.

If dense retrieval is “semantic”, why does it work on proper names?

When I was building a RAG system over a client’s scientific papers, I noticed something odd.

The dense retriever was… kind of good at proper names.

In this domain, that’s not supposed to be easy. You often want things like “Which works by [AUTHOR] on [TOPIC]?” and, in practice, dense retrieval alone wasn’t enough — hybrid (BM25 + dense) clearly did better. But even before I added BM25, the dense model already showed a real preference for the right author over impostors. Not perfect, but definitely more than random.

That surprised me, because it didn’t match the neat mental picture I had of embeddings as “pure semantic meaning in a vector.” Proper names always felt like the awkward cousins of semantics: rare, spiky, messy, often out-of-vocabulary.

My first guess was the boring one: “These are well-known researchers; the model probably saw their names during training and just memorized some association like ’this string → this field’.” But that explanation didn’t fully satisfy me, and it didn’t tell me how fragile that signal was.

So I did the obvious thing for someone who wasn’t convinced by their own story: I started poking the vectors.

A Tiny Lab for Measuring the “Name Effect”

Seeing dense retrieval do surprisingly well on author names, I wanted to put numbers on it. Not full RAG metrics yet, just: how much does the model care about the author name, how does that compare to topic, and what happens if I deliberately break the names?

I set up a small diagnostic task around queries like:

Which papers by [AUTHOR] are about [TOPIC]?

For each query, I built a tiny bundle of candidates that mix and match author and topic:

IDAuthorTopicWhat it is
C1Correct author, correct topic
C2Wrong author (impostor), same topic
C3Correct author, different topic
C4Wrong author, different topic

Then I did the standard dense-retrieval thing: embed the query, embed the four candidates, and look at the cosine similarity between the query and each candidate. From those four scores, I defined three simple margins:

  • the name margin – how much closer C1 is than “same topic, wrong author” (C2),
  • the topic margin – how much closer C1 is than “same author, wrong topic” (C3),
  • the both margin – how much closer C1 is than “wrong author, wrong topic” (C4).

Big name margin → the model really uses the author. Big topic margin → it really uses the topic. Big both margin → the completely-wrong candidate is safely far away.

I also wanted to avoid two traps: overfitting to one lucky query, and accidentally measuring “does the model remember this famous researcher from pretraining?” instead of “how does it handle names in general?”. So I used synthetic authors.

For each language (English and French), I generated about 100 first names and 100 last names, then shuffled and recombined them on every run so names like “Alice Dupont” or “Lucas Martin” look perfectly normal but aren’t tied to any real-world publication history. Topics came from real subject taken from arXiv subject/category, so the questions still sound like actual scientific queries, just with made-up authors attached.

I repeated this a lot — roughly 6,000 runs across queries, languages, and random shuffles. Each run gets fresh impostors; the margins I show below are averages over all of that. The goal was to capture how these models behave in general, not how they treat any single name–topic pair.

What the Clean Setup Says About Names vs Topic

Here’s what those margins look like in the clean, unperturbed Base condition:

LangModelΔ_nameΔ_topicΔ_both
ENOpenAI text-embed-3L0.1750.3050.486
ENVoyage 3.50.1600.2980.464
FROpenAI text-embed-3L0.1390.2600.407
FRVoyage 3.50.1640.2770.447

Each Δ is just “how much better the correct author+topic scores than a certain kind of impostor”:

  • Δ_name (name signal) = score(C1) − score(C2) → How much the model prefers the right author over a wrong author when the topic is the same.

  • Δ_topic (topic signal) = score(C1) − score(C3) → How much it prefers the right topic over the wrong one when the author is the same.

  • Δ_both (everything wrong) = score(C1) − score(C4) → How far it pushes away candidates that are wrong on both author and topic.

What jumps out:

  • All margins are nicely positive → the correct (author, topic) candidate sits clearly closer to the query than the impostors.
  • Topic margins are bigger than name margins in every case → topic is still the main driver, which is reassuring.
  • The “both wrong” margin is largest, as you’d expect when both author and topic are off.

If you take the ratio Δ_name / Δ_topic, you get roughly 0.53–0.59 across models and languages. In other words:

In this setup, proper names carry about half as much separation power as the topic.

So even in a synthetic world with made-up authors, dense “semantic” embeddings are clearly picking up a strong name signal — not as big as topic, but very far from negligible.

That’s the nice version.

The real test is what happens when you attack the name signal on purpose. If you:

  • replace authors with generic masks,
  • turn them into random-looking IDs,
  • or just lightly abuse orthography and layout,

…do the name margins stay healthy, or do they fall off a cliff?

Breaking the Names: How Much Does Δ_name Survive?

I kept the same tiny lab (queries + C1–C4 bundles) and started applying different transformations to the author field and surrounding text. For each ablation, I recomputed the name margin Δ_name and looked at how much it changed relative to the Base condition:

ΔΔ% = (Δ_name(ablated) − Δ_name(base)) / Δ_name(base)

Here’s the impact on English (French tells the same story, with slightly different numbers):

Impact vs Base — EN (ΔΔ% change in Δ_name, 15 runs)

ConditionOpenAI ΔΔ_name %Voyage ΔΔ_name %
hard_name_mask−100.0%−100.0%
gibberish_name−76.9%−68.0%
edit_distance_near_miss−69.3%−64.4%
remove_label−3.0%+15.7%
strip_diacritics−0.0%+0.0%
initials_form−11.0%−8.5%
name_order_inversion−3.6%−3.2%
case_punct_perturb−3.0%−7.4%
author_position_shift+8.4%−6.2%
unicode_normalization_stress−6.9%−15.1%

Ablation Families: What Really Hurts the Name Signal?

The first family is the identity-destroying ones – the brutal ablations that do exactly what you’d expect. In the “hard name mask” condition, I replace all author strings in a bundle with the same generic token, like AUTHOR_007. It’s basically a negative control: if the name margin survived that, the whole diagnostic would be suspect. It doesn’t survive. Δ_name drops by −100% on both models; the margin collapses by construction, which is exactly what we want as a sanity check.

More interesting is what happens with gibberish IDs. Here, each author is replaced by a stable random-looking token such as ID-AB12F3. The same gibberish string appears in the query and the correct candidate, and different gibberish in the impostors, so identity linkage is preserved. What disappears is the linguistic form: no “Manon”, no “Michel”, just token fragments like ID, -, AB, 12, F3. Under this ablation, Δ_name drops by roughly 70%. There is still a small edge for “same gibberish string” over “different gibberish string”, but most of the original separation is gone. A similar story shows up with near-miss edits: for impostor authors, I lightly corrupt the true name by one to three character edits, while C1 keeps the clean version. That’s enough to drag Δ_name down by about 65–70% as well. Tiny spelling changes are enough to make impostors dangerously close in embedding space.

The takeaway from this family is that the strong name signal in the Base condition is not some deep, semantic understanding of invented authors. It is mostly a combination of exact string matching and the fact that realistic names are built from familiar subword pieces the model has seen a lot during pretraining. As soon as you remove that structure (masking, gibberish, hard edits), the name margin collapses.

The second family covers light orthography and formatting, the kind of things normal text pipelines do all the time: changing case, tweaking punctuation, stripping accents, using initials, swapping “First Last” and “Last, First”, and playing Unicode normalization games. Concretely, this includes conditions like initials_form, name_order_inversion, case_punct_perturb, strip_diacritics, and unicode_normalization_stress. Here the numbers are much calmer: most Δ_name changes live in the −3% to −12% range, diacritics are literally 0% in EN and tiny in FR, and even initials or name-order inversions only shave off a small slice of the margin. Modern embedders are, in practice, quite robust to this boring formatting noise. They don’t care much whether you write “Alice Dupont”, “DUPONT, Alice”, “A. Dupont”, or “Alice Dupont” without accents. That’s good news if you’re doing normal normalization – lowercasing, accent stripping, mild punctuation cleanup – you’re probably not quietly destroying your name signal.

The third family is layout and structure, which lives in a more awkward middle ground. Here I change where the name appears and whether it has an explicit label. In the remove_label condition, I delete “Author:” / “Auteur :” tags; in author_position_shift, I move the author segment to the very front of the candidate. These manipulations don’t behave as cleanly as the others. For remove_label, Δ_name barely moves in English (around −3% for one model and +15–16% for the other), and in French it sometimes even increases noticeably. For author_position_shift, OpenAI EN sees a modest +8.4% boost in Δ_name, while Voyage EN drops by −6.2% under the same change. The picture is similar, but not identical, in French.

Here the takeaway is more nuanced: document structure clearly affects how much the author field “counts” in the embedding, but the effect is model- and language-specific. Some models seem to like seeing the author early; some care more about explicit labels; training corpora in English and French don’t lay things out in the same way. If you aggressively reformat documents or mix layouts from different sources, you’re poking this structural sensitivity whether you mean to or not.

The Plot Twist Nobody Asked For

The contrast between real-looking names and gibberish IDs was the most revealing part. When I swapped “Alice Dupont” for something like ID-AB12F3, the name margin didn’t vanish, but it collapsed hard. Δ_name dropped by roughly 70%: the model still gave a small boost to “same gibberish string” over “different gibberish string”, but the strong separation you get with natural names was mostly gone. That suggests the “name understanding” here isn’t some deep notion of identity; it’s largely built on surface form and tokenization – capital letters, plausible syllables, typical name patterns – plus a thinner layer of exact-match bias that survives even when the text degenerates into random-looking IDs.

Seeing that pattern, I went hunting through the literature for anyone who’d poked at names and embeddings in a similar way, to check whether this “surface form + exact match” story lined up with what people had already measured.

What the Literature Says About Names, Embeddings, and Retrieval The pattern fits pretty well with what others have seen. BERT-style models are known to be oddly brittle around proper names: swap one person or location name for another of the same type and performance can crater, with analyses tying this to tokenization and frequency effects on named entities [1]. Dense retrievers that look great on mainstream QA often lose badly to BM25 on rare-entity questions like those in EntityQuestions, where simple fact queries such as “Where was Arve Furset born?” expose poor generalization beyond common entities and seen patterns [2]. Follow-up work either bolts lexical signals back onto dense models—SPAR, for example, augments a dense retriever with a BM25-like lexical component and recovers performance on entity-heavy benchmarks [3]—or tries to make entities less “random token salad” via knowledge-infused pretraining, as in ERNIE and K-BERT [4, 5]. More recent RAG robustness work shows similarly large drops when you just change how a query is written—style, formality, grammaticality—without changing what it’s actually asking, with Recall@5 drops of up to ~40% for less formal or ungrammatical queries [6].

My take, as a working hypothesis built on top of all this and the experiments in this post, is:

Proper names absolutely carry signal in dense embeddings—but mostly as high-weight lexical anchors, not as “deep” semantic objects. For anything that leans on rare entities, IDs, or precise attribution, you’ll usually want help from sparse retrieval, metadata filters, or graph-style RAG on top of the dense layer.

They’re rare, distinctive strings with familiar subword structure and strong co-occurrence patterns. That makes them powerful anchors in vector space: great when the string lines up, dangerous when it mutates, turns into gibberish, or gets buried in noisy context. The Δ-margins and gibberish-name experiments above are essentially measuring that anchoring effect directly. If you want the full experimental details, they’re written up in my paper [7]: https://arxiv.org/abs/2511.09545

References

  1. Balasubramanian et al. What’s in a Name? Are BERT Named Entity Representations just as Good for any other Name? RepL4NLP, 2020. (ACL Anthology)
  2. Sciavolino et al. Simple Entity-Centric Questions Challenge Dense Retrievers (EntityQuestions). EMNLP, 2021. (arXiv)
  3. Chen et al. Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One? Findings of EMNLP, 2022. (ACL Anthology)
  4. Sun et al. ERNIE: Enhanced Representation through Knowledge Integration / ERNIE 3.0: Knowledge Enhanced Pre-training. (arXiv)
  5. Liu et al. K-BERT: Enabling Language Representation with Knowledge Graph. AAAI, 2020. (arXiv)
  6. Cao et al. Out of Style: RAG’s Fragility to Linguistic Variation. arXiv, 2025.
  7. Etienne Dallaire. Practical RAG Evaluation: A Rarity-Aware Set-Based Metric and Cost-Latency-Quality Trade-offs. arXiv:2511.09545, 2025. (arXiv)
Built with Hugo
Theme Stack designed by Jimmy