[Paper] NanoKnow: How to Know What Your Language Model Knows

Published: (February 23, 2026 at 01:37 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.20122v1

Overview

The paper introduces NanoKnow, a new benchmark that lets researchers and engineers tease apart what large language models (LLMs) actually “know” from their training data versus what they can retrieve from external sources. By leveraging the fully open‑source nanochat family of small LLMs—whose pre‑training corpora are publicly available—the authors can label each question as either “seen” (the answer appears in the training set) or “unseen.” This makes it possible to study the interplay between parametric knowledge (stored in model weights) and retrieved evidence in a way that was previously impossible for closed‑source models.

Key Contributions

  • NanoKnow dataset: A split of Natural Questions and SQuAD questions into seen vs. unseen based on whether the answer occurs in nanochat’s pre‑training data.
  • Transparent evaluation framework: Enables clean separation of parametric knowledge and external evidence for any model that can be queried with or without retrieved context.
  • Empirical insights: Systematic experiments on eight nanochat checkpoints reveal how answer frequency, external evidence, and irrelevant context affect closed‑book and open‑book performance.
  • Open‑source release: All data, scripts, and evaluation code are publicly available on GitHub, encouraging reproducibility and community extensions.

Methodology

  1. Data Partitioning – The authors scan nanochat’s pre‑training corpus to check if each answer string appears. Questions whose answers are found become the Seen split; the rest become Unseen.
  2. Model Checkpoints – Eight checkpoints of nanochat (varying in size and training steps) are evaluated.
  3. Evaluation Modes
    • Closed‑book: The model answers the question with no external context.
    • Open‑book: The model is given retrieved passages (relevant or deliberately noisy) as additional input.
  4. Metrics – Exact‑match and F1 scores are reported per split, and the impact of answer frequency, passage relevance, and passage position is analyzed.

The pipeline is deliberately simple: retrieve passages (or feed none), prepend them to the prompt, and let the model generate an answer. This design keeps the focus on the knowledge source rather than on sophisticated retrieval or prompting tricks.

Results & Findings

FindingWhat the numbers show
Answer frequency mattersClosed‑book accuracy correlates strongly with how often the answer string appears in the pre‑training data. Frequently seen answers are recalled far more reliably.
External evidence helpsProviding relevant retrieved passages lifts performance on the Unseen split, reducing the gap between Seen and Unseen questions.
Parametric + external knowledge are complementaryEven with perfect evidence, models still do better on Seen questions, indicating that stored knowledge and retrieved text each contribute uniquely.
Irrelevant context hurtsAdding non‑relevant passages degrades accuracy; the damage grows with the number of distractors and is worse when irrelevant text appears earlier in the prompt.

Overall, the experiments demonstrate that LLMs are not pure “knowledge bases”—they rely on a blend of memorized facts and on‑the‑fly retrieval, and both can be sabotaged by noisy inputs.

Practical Implications

  • Better debugging tools – Developers can use NanoKnow‑style splits to pinpoint whether a model’s mistake stems from missing training data or from poor retrieval, guiding targeted data augmentation.
  • Informed prompting – Knowing that early irrelevant context hurts performance suggests placing the most relevant evidence near the end of the prompt or using explicit separators.
  • Hybrid systems design – The complementary nature of parametric and external knowledge encourages architectures that combine a compact, high‑capacity LLM with a lightweight retrieval component, rather than relying on one alone.
  • Data‑centric development – For domain‑specific applications (e.g., medical or legal assistants), measuring answer frequency in the pre‑training set can help estimate how much additional fine‑tuning or curated data is needed.
  • Evaluation standards – NanoKnow provides a reproducible benchmark for “knowledge‑aware” LLMs, which could become a standard test for future open‑source models.

Limitations & Future Work

  • Scale mismatch – NanoKnow is built on small nanochat models; results may not transfer directly to billion‑parameter LLMs where memorization dynamics differ.
  • Binary seen/unseen labeling – The current split treats any occurrence of the answer string as “seen,” ignoring nuances like paraphrasing or contextual understanding.
  • Retrieval quality – Experiments use simple retrieval pipelines; more sophisticated retrievers could change the balance between parametric and external knowledge.
  • Future directions suggested by the authors include extending the benchmark to larger models, incorporating graded “knowledge difficulty” scores, and exploring training objectives that explicitly align parametric and retrieved knowledge.

Authors

  • Lingwei Gu
  • Nour Jedidi
  • Jimmy Lin

Paper Information

  • arXiv ID: 2602.20122v1
  • Categories: cs.CL, cs.AI, cs.IR, cs.LG
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »