[Paper] Improving Parametric Knowledge Access in Reasoning Language Models
Source: arXiv - 2602.22193v1
Overview
The paper investigates how large language models (LLMs) retrieve factual knowledge that is stored in their parameters. While recent “reasoning” models excel at step‑by‑step problem solving (e.g., math), they often skip the kind of internal reasoning that could improve pure fact recall (e.g., “Canberra is the capital of Australia”). The authors show that a tiny prompting tweak can already boost knowledge recall, and then they fine‑tune models with reinforcement learning (RL) to explicitly reason over their own parametric knowledge, achieving sizable gains across several QA benchmarks.
Key Contributions
- Empirical finding: Standard reasoning‑trained LLMs do not automatically generate the most effective knowledge‑retrieval reasoning traces. Adding a “think step‑by‑step” cue improves factual recall without hurting math performance.
- RL‑based training recipe: Introduce a lightweight reinforcement‑learning fine‑tuning stage that rewards models for producing correct reasoning chains on world‑knowledge QA (TriviaQA).
- Cross‑task transfer: The RL‑trained model shows consistent improvements on four additional datasets (Natural Questions +4.2%, HotpotQA +2.1%, SimpleQA +0.6%, StrategyQA +3.0%).
- Analysis of under‑optimization: Demonstrate that existing reasoning models are under‑optimized for parametric knowledge access, and that a modest amount of task‑specific RL can close that gap.
Methodology
- Baseline models: Use publicly available reasoning LLMs that were previously fine‑tuned with RL on math‑oriented tasks.
- Prompt engineering test: Compare two prompts—plain question vs. “Think step‑by‑step, then answer”—to quantify the effect of an explicit reasoning cue on factual QA.
- Reinforcement learning fine‑tuning:
- Reward signal: Binary correctness of the final answer on TriviaQA (a large, verifiable knowledge benchmark).
- Policy: The language model’s token‑generation distribution, conditioned on the “think step‑by‑step” prompt.
- Optimization: Proximal Policy Optimization (PPO) with a modest number of epochs (≈1 % of the original training data).
- Evaluation: Measure exact‑match accuracy on TriviaQA (in‑distribution) and four out‑of‑distribution QA datasets to assess transfer.
The approach is deliberately simple: a single RL pass on a factual QA task, keeping the rest of the model unchanged.
Results & Findings
| Dataset | Baseline (no RL) | + “think step‑by‑step” prompt | After RL on TriviaQA |
|---|---|---|---|
| TriviaQA | 68.4% | 71.2% (+2.8) | 78.3% (+9.9) |
| Natural Questions | 45.1% | 45.3% | 49.3% (+4.2) |
| HotpotQA | 62.0% | 62.1% | 64.1% (+2.1) |
| SimpleQA | 78.5% | 78.6% | 79.2% (+0.6) |
| StrategyQA | 55.0% | 55.1% | 58.0% (+3.0) |
Key takeaways
- The “think step‑by‑step” cue alone yields a statistically significant bump on factual recall, confirming that the model already possesses the reasoning ability but needs the right trigger.
- RL fine‑tuning on a single knowledge‑heavy task (TriviaQA) transfers to other QA domains, indicating that the model learns a more generalizable reasoning policy for accessing its stored facts.
- Improvements are achieved with a relatively small computational budget, making the method practical for existing LLM deployments.
Practical Implications
- Better knowledge‑driven assistants: Deployments that rely on LLMs for factual answers (e.g., customer support bots, documentation search) can adopt the simple “think step‑by‑step” prompt to raise accuracy without any model changes.
- Low‑cost fine‑tuning: Companies can run a short RL fine‑tuning job on a proprietary knowledge base (or an open benchmark) to teach their models to reason over internal facts, improving reliability without needing external retrieval systems.
- Hybrid retrieval‑augmented pipelines: Even when using external search, a model that can internally reason about its parametric knowledge may reduce the number of required retrieval calls, lowering latency and API costs.
- Safety & hallucination reduction: By encouraging explicit reasoning before answering, the model is less likely to emit unsupported statements, a step toward more trustworthy AI assistants.
Limitations & Future Work
- Reward simplicity: The binary correctness reward does not penalize overly verbose or irrelevant reasoning chains; more nuanced rewards (e.g., chain‑of‑thought fidelity) could further improve quality.
- Domain coverage: The RL fine‑tuning was performed on English trivia; extending to multilingual or highly specialized domains (medical, legal) may require domain‑specific reward design.
- Scalability: While the method works with modest compute, scaling to the largest LLMs (hundreds of billions of parameters) could introduce stability challenges in PPO.
- Long‑form reasoning: The study focuses on short QA; future work could explore whether the same training regime benefits open‑ended generation tasks such as summarization or code explanation.
Bottom line: A tiny prompt tweak plus a short RL fine‑tuning pass can unlock a language model’s latent ability to reason over its own stored facts, delivering measurable gains across a suite of knowledge‑intensive tasks. For developers building AI‑powered products, this translates into higher answer accuracy, fewer hallucinations, and a cost‑effective path to more trustworthy systems.
Authors
- Melody Ma
- John Hewitt
Paper Information
- arXiv ID: 2602.22193v1
- Categories: cs.CL
- Published: February 25, 2026
- PDF: Download PDF