[Paper] Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Published: (April 23, 2026 at 01:25 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.21882v1

Overview

Large language models (LLMs) are praised for their ability to recall factual information, but how they store and retrieve that knowledge can be surprisingly fragile. This paper introduces RedirectQA, a new benchmark that probes whether LLMs truly understand facts about entities or merely memorize a single “canonical” name. By systematically swapping an entity’s surface form—its aliases, abbreviations, misspellings, etc.—the authors reveal how sensitive LLMs are to the way a question is phrased.

Key Contributions

  • RedirectQA dataset: Built from Wikipedia redirects, it links Wikidata triples to a rich taxonomy of surface forms (canonical names, aliases, abbreviations, spelling variants, common errors).
  • Comprehensive evaluation: Tested 13 popular LLMs (including GPT‑3.5, LLaMA, and PaLM) on the same factual triples expressed with different surface forms.
  • Surface‑conditioned memorization analysis: Quantified how answer correctness varies when only the entity name changes, uncovering systematic patterns across surface‑form categories.
  • Frequency‑based insights: Showed that both the overall frequency of an entity in training data and the frequency of a specific surface form influence accuracy, with entity frequency often providing an extra boost.
  • Practical diagnostic framework: Provides a reusable methodology for developers to audit LLMs’ factual robustness beyond verbatim memorization.

Methodology

  1. Data Construction

    • Extracted factual triples (entity → relation → object) from Wikidata.
    • Mapped each entity to its Wikipedia page and collected all redirect pages, which naturally encode alternative surface forms.
    • Categorized each surface form into five buckets: canonical, alias, abbreviation, spelling variant, and erroneous form.
  2. Prompt Design

    • For every triple, generated a short QA prompt (“What is the capital of [surface form]?”) where the only variable was the surface form.
    • Kept the rest of the prompt identical to isolate the effect of the name change.
  3. Model Evaluation

    • Queried each LLM in a zero‑shot setting (no fine‑tuning) and recorded the generated answer.
    • Applied a simple string‑matching and fuzzy‑matching post‑processor to decide if the answer was correct.
  4. Analysis

    • Measured consistency: the proportion of surface forms that yielded the same correct answer for a given entity.
    • Conducted regression analyses to tease apart the impact of entity frequency (how often the entity appears in pre‑training data) vs. surface‑form frequency.

Results & Findings

Surface‑form categoryAvg. accuracy (across models)Consistency drop vs. canonical
Canonical name78 %
Alias65 %–13 pts
Abbreviation60 %–18 pts
Spelling variant71 %–7 pts
Erroneous form48 %–30 pts
  • Lexical variation matters: Models handle minor orthographic tweaks (e.g., “United States” → “United‑States”) relatively well, but struggle with larger lexical shifts such as abbreviations (“USA”) or less‑common aliases (“America”).
  • Frequency effect: Entities that appear frequently in the pre‑training corpus are recalled more reliably, even when presented with rare surface forms. Conversely, a high‑frequency surface form can compensate for a low‑frequency entity to some extent.
  • No pure invariance: The same fact is not always retrieved consistently across surface forms, disproving the assumption that LLMs store facts in a fully abstract, name‑agnostic way.

Practical Implications

  • Robust QA systems: When building chatbots or search assistants, developers should anticipate that users will refer to the same entity in many ways. Adding synonym expansion or surface‑form normalization can dramatically improve answer reliability.
  • Prompt engineering: Simple tweaks—using the most common name or adding clarifying context—can boost factual accuracy without any model changes.
  • Model selection: The study provides a quick sanity‑check for choosing an LLM based on how tolerant it is to name variation, which is crucial for multilingual or domain‑specific applications where aliases abound.
  • Safety & compliance: In regulated domains (e.g., medical or financial advice), inconsistent recall of facts could lead to misinformation. Incorporating surface‑form diversity into evaluation pipelines helps surface hidden brittleness before deployment.

Limitations & Future Work

  • Zero‑shot focus: The experiments only considered models without any task‑specific fine‑tuning. Future work should explore whether instruction‑tuned or retrieval‑augmented models exhibit higher surface‑form invariance.
  • English‑centric data: RedirectQA relies on English Wikipedia redirects, so the findings may not directly transfer to other languages with different naming conventions.
  • Answer verification: The evaluation uses string matching, which can miss correct paraphrases. More sophisticated semantic matching could refine accuracy estimates.
  • Beyond entities: Extending the methodology to relational phrases (e.g., “the 44th president”) or to non‑entity facts would broaden the understanding of non‑verbatim memorization.

Bottom line: This paper shows that LLMs’ factual memory is a delicate dance between the entity itself and the way we name it. For developers building reliable AI products, accounting for surface‑form diversity isn’t just a nice‑to‑have—it’s a practical necessity.

Authors

  • Yuto Nishida
  • Naoki Shikoda
  • Yosuke Kishinami
  • Ryo Fujii
  • Makoto Morishita
  • Hidetaka Kamigaito
  • Taro Watanabe

Paper Information

  • arXiv ID: 2604.21882v1
  • Categories: cs.CL
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »