[Paper] NanoKnow: How to Know What Your Language Model Knows

Published: 3 days ago (February 23, 2026 at 01:37 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.20122v1

Overview

The paper introduces NanoKnow, a new benchmark that lets researchers and engineers tease apart what large language models (LLMs) actually “know” from their training data versus what they can retrieve from external sources. By leveraging the fully open‑source nanochat family of small LLMs—whose pre‑training corpora are publicly available—the authors can label each question as either “seen” (the answer appears in the training set) or “unseen.” This makes it possible to study the interplay between parametric knowledge (stored in model weights) and retrieved evidence in a way that was previously impossible for closed‑source models.

Key Contributions

NanoKnow dataset: A split of Natural Questions and SQuAD questions into seen vs. unseen based on whether the answer occurs in nanochat’s pre‑training data.
Transparent evaluation framework: Enables clean separation of parametric knowledge and external evidence for any model that can be queried with or without retrieved context.
Empirical insights: Systematic experiments on eight nanochat checkpoints reveal how answer frequency, external evidence, and irrelevant context affect closed‑book and open‑book performance.
Open‑source release: All data, scripts, and evaluation code are publicly available on GitHub, encouraging reproducibility and community extensions.

Methodology

Data Partitioning – The authors scan nanochat’s pre‑training corpus to check if each answer string appears. Questions whose answers are found become the Seen split; the rest become Unseen.
Model Checkpoints – Eight checkpoints of nanochat (varying in size and training steps) are evaluated.
Evaluation Modes
- Closed‑book: The model answers the question with no external context.
- Open‑book: The model is given retrieved passages (relevant or deliberately noisy) as additional input.
Metrics – Exact‑match and F1 scores are reported per split, and the impact of answer frequency, passage relevance, and passage position is analyzed.

The pipeline is deliberately simple: retrieve passages (or feed none), prepend them to the prompt, and let the model generate an answer. This design keeps the focus on the knowledge source rather than on sophisticated retrieval or prompting tricks.

Results & Findings

Finding	What the numbers show
Answer frequency matters	Closed‑book accuracy correlates strongly with how often the answer string appears in the pre‑training data. Frequently seen answers are recalled far more reliably.
External evidence helps	Providing relevant retrieved passages lifts performance on the Unseen split, reducing the gap between Seen and Unseen questions.
Parametric + external knowledge are complementary	Even with perfect evidence, models still do better on Seen questions, indicating that stored knowledge and retrieved text each contribute uniquely.
Irrelevant context hurts	Adding non‑relevant passages degrades accuracy; the damage grows with the number of distractors and is worse when irrelevant text appears earlier in the prompt.

Overall, the experiments demonstrate that LLMs are not pure “knowledge bases”—they rely on a blend of memorized facts and on‑the‑fly retrieval, and both can be sabotaged by noisy inputs.

Practical Implications

Better debugging tools – Developers can use NanoKnow‑style splits to pinpoint whether a model’s mistake stems from missing training data or from poor retrieval, guiding targeted data augmentation.
Informed prompting – Knowing that early irrelevant context hurts performance suggests placing the most relevant evidence near the end of the prompt or using explicit separators.
Hybrid systems design – The complementary nature of parametric and external knowledge encourages architectures that combine a compact, high‑capacity LLM with a lightweight retrieval component, rather than relying on one alone.
Data‑centric development – For domain‑specific applications (e.g., medical or legal assistants), measuring answer frequency in the pre‑training set can help estimate how much additional fine‑tuning or curated data is needed.
Evaluation standards – NanoKnow provides a reproducible benchmark for “knowledge‑aware” LLMs, which could become a standard test for future open‑source models.

Limitations & Future Work

Scale mismatch – NanoKnow is built on small nanochat models; results may not transfer directly to billion‑parameter LLMs where memorization dynamics differ.
Binary seen/unseen labeling – The current split treats any occurrence of the answer string as “seen,” ignoring nuances like paraphrasing or contextual understanding.
Retrieval quality – Experiments use simple retrieval pipelines; more sophisticated retrievers could change the balance between parametric and external knowledge.
Future directions suggested by the authors include extending the benchmark to larger models, incorporating graded “knowledge difficulty” scores, and exploring training objectives that explicitly align parametric and retrieved knowledge.

Authors

Lingwei Gu
Nour Jedidi
Jimmy Lin

Paper Information

arXiv ID: 2602.20122v1
Categories: cs.CL, cs.AI, cs.IR, cs.LG
Published: February 23, 2026
PDF: Download PDF

[Paper] NanoKnow: How to Know What Your Language Model Knows

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Dynamic Personality Adaptation in Large Language Models via State Machines

[Paper] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models