[Paper] LLMs Know More About Numbers than They Can Say

Published: 3 days ago (February 7, 2026 at 11:15 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.07812v1

Overview

Recent research shows that large language models (LLMs) often know the magnitude of numbers internally, even when they stumble on simple comparison questions like “Which is larger, 5.7 × 10² or 580?”. By probing the hidden states of several open‑source LLMs, the authors demonstrate that a single linear read‑out can recover a number’s log‑magnitude with surprisingly low error. The gap between this hidden‑knowledge and the model’s spoken answer points to a new frontier for improving numerical reasoning in LLMs.

Key Contributions

Hidden‑state probe: A linear projection of a single hidden layer reliably encodes the logarithm of a numeral’s magnitude (≈ 2.3 % relative error on synthetic data, ≈ 19 % on scientific text).
Ranking signal: After processing a pair of numbers, the model’s hidden state contains enough information for a linear classifier to predict which is larger with > 90 % accuracy.
Performance paradox: When asked to verbalize the comparison, the same models only achieve 50–70 % accuracy, revealing a disconnect between internal representation and output generation.
Fine‑tuning with auxiliary loss: Adding the classifier’s log‑loss as an auxiliary objective during fine‑tuning improves verbalized ranking accuracy by an additional 3.22 % over the base model.
Open‑source focus: Experiments are conducted on several smaller, publicly available LLMs, making the findings reproducible for the community.

Methodology

Dataset construction
- Synthetic: Randomly generated numeral pairs in mixed notation (decimal, scientific, fraction, etc.).
- Real‑world: Numeral pairs extracted from scientific papers, preserving the natural distribution of notation.
Probing hidden states
- For each model, the authors identified a middle transformer layer whose activations were most linearly correlated with the log‑magnitude of the current token.
- A simple linear regression (single weight vector + bias) was trained to map these activations to the true log‑value.
Ranking classifier
- After the model reads both numbers, the final hidden state is fed to a linear binary classifier that predicts “first > second”.
- Accuracy is measured on held‑out numeral pairs.
Evaluation of verbalized answers
- The same models are prompted with natural‑language comparison questions and their textual answers are parsed to extract the chosen larger number.
Auxiliary‑loss fine‑tuning
- During further training, the classifier’s cross‑entropy loss is added to the standard language‑model loss, encouraging the model to align its internal magnitude signal with its output generation.

All experiments are run on open‑source models (e.g., LLaMA‑derived variants) to keep the work transparent and extensible.

Results & Findings

Metric	Synthetic text	Scientific papers
Log‑magnitude reconstruction error	2.3 % (relative)	19.06 %
Ranking classifier accuracy	> 90 %	> 90 %
Verbalized comparison accuracy (base model)	50–70 %	50–70 %
Verbalized accuracy after auxiliary‑loss fine‑tuning	+3.22 % over base	+3.22 % over base

Key takeaways

The hidden layers already contain a robust numeric magnitude representation, even for mixed notation.
The models can compare numbers internally with high reliability, but this knowledge rarely surfaces in the generated text.
Encouraging the model to expose its internal ranking signal during training yields measurable gains in the final answer quality.

Practical Implications

Better numerical reasoning APIs: Developers can attach lightweight linear probes to existing LLMs to obtain accurate magnitude estimates without full fine‑tuning, enabling fast “numeric sense” checks in downstream applications (e.g., data validation, spreadsheet assistants).
Improved prompting strategies: Knowing that models retain magnitude information internally suggests that prompting techniques that force the model to “think step‑by‑step” (e.g., chain‑of‑thought) may help surface the hidden knowledge.
Fine‑tuning recipes: Adding a simple auxiliary loss that rewards correct internal ranking can be incorporated into any fine‑tuning pipeline, offering a low‑cost way to boost numeric reliability for domain‑specific LLMs (finance, scientific literature, engineering).
Debugging toolkits: Probes can serve as diagnostics for model interpretability—if a model’s hidden states fail to encode magnitudes, that may explain poor arithmetic performance and guide model selection.

Overall, the work points to a practical recipe: probe → align → expose, turning latent numeric competence into trustworthy, user‑visible behavior.

Limitations & Future Work

Scope of numerals: The study focuses on magnitude comparison; other numeric operations (addition, subtraction, unit conversion) remain untested.
Model size: Experiments are limited to smaller open‑source models; it is unclear how the findings scale to the largest commercial LLMs.
Domain bias: The higher reconstruction error on scientific papers suggests that noisy, context‑rich text can degrade the probe’s fidelity.
Auxiliary loss impact: While the auxiliary loss improves verbalized accuracy, the gain is modest (≈ 3 %). Future work could explore richer multi‑task objectives or curriculum learning to more tightly couple internal representations with output generation.

By extending probing to a broader set of numeric tasks and larger models, the community can further close the gap between what LLMs know and what they say.

Authors

Fengting Yuchi
Li Du
Jason Eisner

Paper Information

arXiv ID: 2602.07812v1
Categories: cs.CL
Published: February 8, 2026
PDF: Download PDF

[Paper] LLMs Know More About Numbers than They Can Say

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

[Paper] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

[Paper] Anagent For Enhancing Scientific Table & Figure Analysis

[Paper] CAPID: Context-Aware PII Detection for Question-Answering Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

[Paper] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

[Paper] Anagent For Enhancing Scientific Table &amp; Figure Analysis

[Paper] CAPID: Context-Aware PII Detection for Question-Answering Systems

[Paper] Anagent For Enhancing Scientific Table & Figure Analysis