[Paper] LLMs Know More About Numbers than They Can Say

Published: (February 7, 2026 at 11:15 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.07812v1

Overview

Recent research shows that large language models (LLMs) often know the magnitude of numbers internally, even when they stumble on simple comparison questions like “Which is larger, 5.7 × 10² or 580?”. By probing the hidden states of several open‑source LLMs, the authors demonstrate that a single linear read‑out can recover a number’s log‑magnitude with surprisingly low error. The gap between this hidden‑knowledge and the model’s spoken answer points to a new frontier for improving numerical reasoning in LLMs.

Key Contributions

  • Hidden‑state probe: A linear projection of a single hidden layer reliably encodes the logarithm of a numeral’s magnitude (≈ 2.3 % relative error on synthetic data, ≈ 19 % on scientific text).
  • Ranking signal: After processing a pair of numbers, the model’s hidden state contains enough information for a linear classifier to predict which is larger with > 90 % accuracy.
  • Performance paradox: When asked to verbalize the comparison, the same models only achieve 50–70 % accuracy, revealing a disconnect between internal representation and output generation.
  • Fine‑tuning with auxiliary loss: Adding the classifier’s log‑loss as an auxiliary objective during fine‑tuning improves verbalized ranking accuracy by an additional 3.22 % over the base model.
  • Open‑source focus: Experiments are conducted on several smaller, publicly available LLMs, making the findings reproducible for the community.

Methodology

  1. Dataset construction

    • Synthetic: Randomly generated numeral pairs in mixed notation (decimal, scientific, fraction, etc.).
    • Real‑world: Numeral pairs extracted from scientific papers, preserving the natural distribution of notation.
  2. Probing hidden states

    • For each model, the authors identified a middle transformer layer whose activations were most linearly correlated with the log‑magnitude of the current token.
    • A simple linear regression (single weight vector + bias) was trained to map these activations to the true log‑value.
  3. Ranking classifier

    • After the model reads both numbers, the final hidden state is fed to a linear binary classifier that predicts “first > second”.
    • Accuracy is measured on held‑out numeral pairs.
  4. Evaluation of verbalized answers

    • The same models are prompted with natural‑language comparison questions and their textual answers are parsed to extract the chosen larger number.
  5. Auxiliary‑loss fine‑tuning

    • During further training, the classifier’s cross‑entropy loss is added to the standard language‑model loss, encouraging the model to align its internal magnitude signal with its output generation.

All experiments are run on open‑source models (e.g., LLaMA‑derived variants) to keep the work transparent and extensible.

Results & Findings

MetricSynthetic textScientific papers
Log‑magnitude reconstruction error2.3 % (relative)19.06 %
Ranking classifier accuracy> 90 %> 90 %
Verbalized comparison accuracy (base model)50–70 %50–70 %
Verbalized accuracy after auxiliary‑loss fine‑tuning+3.22 % over base+3.22 % over base

Key takeaways

  • The hidden layers already contain a robust numeric magnitude representation, even for mixed notation.
  • The models can compare numbers internally with high reliability, but this knowledge rarely surfaces in the generated text.
  • Encouraging the model to expose its internal ranking signal during training yields measurable gains in the final answer quality.

Practical Implications

  • Better numerical reasoning APIs: Developers can attach lightweight linear probes to existing LLMs to obtain accurate magnitude estimates without full fine‑tuning, enabling fast “numeric sense” checks in downstream applications (e.g., data validation, spreadsheet assistants).
  • Improved prompting strategies: Knowing that models retain magnitude information internally suggests that prompting techniques that force the model to “think step‑by‑step” (e.g., chain‑of‑thought) may help surface the hidden knowledge.
  • Fine‑tuning recipes: Adding a simple auxiliary loss that rewards correct internal ranking can be incorporated into any fine‑tuning pipeline, offering a low‑cost way to boost numeric reliability for domain‑specific LLMs (finance, scientific literature, engineering).
  • Debugging toolkits: Probes can serve as diagnostics for model interpretability—if a model’s hidden states fail to encode magnitudes, that may explain poor arithmetic performance and guide model selection.

Overall, the work points to a practical recipe: probe → align → expose, turning latent numeric competence into trustworthy, user‑visible behavior.

Limitations & Future Work

  • Scope of numerals: The study focuses on magnitude comparison; other numeric operations (addition, subtraction, unit conversion) remain untested.
  • Model size: Experiments are limited to smaller open‑source models; it is unclear how the findings scale to the largest commercial LLMs.
  • Domain bias: The higher reconstruction error on scientific papers suggests that noisy, context‑rich text can degrade the probe’s fidelity.
  • Auxiliary loss impact: While the auxiliary loss improves verbalized accuracy, the gain is modest (≈ 3 %). Future work could explore richer multi‑task objectives or curriculum learning to more tightly couple internal representations with output generation.

By extending probing to a broader set of numeric tasks and larger models, the community can further close the gap between what LLMs know and what they say.

Authors

  • Fengting Yuchi
  • Li Du
  • Jason Eisner

Paper Information

  • arXiv ID: 2602.07812v1
  • Categories: cs.CL
  • Published: February 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »