[Paper] The Representational Geometry of Number
Source: arXiv - 2602.06843v1
Overview
The paper The Representational Geometry of Number investigates how large language models (LLMs) internally organize numeric concepts across different tasks. By treating numbers as points in a high‑dimensional space, the authors show that while task‑specific embeddings occupy distinct subspaces, the relationships between numbers (e.g., “larger than”, “even vs. odd”) stay remarkably consistent. This insight bridges two competing ideas in cognitive science—shared conceptual manifolds vs. orthogonal task spaces—and offers a concrete, mechanistic explanation for how LLMs can both generalize and specialize.
Key Contributions
- Relational‑first perspective: Proposes that shared structure lives in the relations among concepts rather than in the concepts themselves.
- Empirical evidence with numbers: Demonstrates that magnitude and parity are encoded along stable linear directions across multiple downstream tasks (e.g., arithmetic, classification, reasoning).
- Subspace decomposition: Shows that each task’s number embeddings reside in a distinct, low‑dimensional subspace, yet these subspaces are linearly transformable into one another.
- Linear mapping analysis: Introduces a simple linear‑regression framework to quantify how well one task’s subspace can be mapped onto another, revealing high fidelity (R² > 0.9 in most cases).
- Mechanistic account: Provides a unified explanation for how LLMs balance shared relational knowledge with task‑specific flexibility, offering a new lens for interpretability research.
Methodology
- Model selection: The authors fine‑tuned several popular transformer‑based language models (e.g., GPT‑2, LLaMA) on a suite of number‑centric tasks: magnitude comparison, parity detection, arithmetic word problems, and numeric reasoning.
- Embedding extraction: For each task, they collected the hidden‑state vectors corresponding to numeric tokens (e.g., “seven”, “42”) from the final transformer layer.
- Subspace identification: Using Principal Component Analysis (PCA) on each task’s embeddings, they isolated the top few components that captured > 95 % of variance, defining a task‑specific subspace.
- Relational probing: Linear probes were trained to predict scalar magnitude and binary parity from the embeddings. The direction vectors (weights) of these probes served as “relational axes.”
- Cross‑task mapping: Pairwise linear regression models were fitted to map embeddings from one task’s subspace to another’s. Mapping quality was assessed via reconstruction error and cosine similarity of relational axes.
- Visualization: t‑SNE and 2‑D PCA plots illustrated that numbers form consistent geometric patterns (e.g., a monotonic line for magnitude) even when the absolute coordinates differ across tasks.
Results & Findings
- Stable relational axes: The probe weights for magnitude and parity were nearly identical (cosine similarity > 0.98) across all tasks, indicating a shared relational geometry.
- Distinct subspaces: Each task’s embeddings occupied a unique subspace, with minimal overlap in the raw coordinate space (average subspace angle ≈ 45°).
- High‑fidelity linear transforms: Mapping one task’s subspace to another recovered > 90 % of variance, and the relational axes remained aligned after transformation.
- Task‑specific nuances: While magnitude and parity were universal, more complex relations (e.g., “multiple of three”) showed weaker but still linear‑transformable patterns, suggesting a hierarchy of relational stability.
- Model‑agnostic behavior: The phenomena held across model sizes (from 124 M to 7 B parameters) and across both encoder‑only and decoder‑only architectures.
Practical Implications
- Transfer learning shortcuts: Knowing that relational structure is preserved means developers can fine‑tune a model on a cheap proxy task (e.g., parity detection) and reliably reuse the learned embeddings for more expensive numeric reasoning tasks, simply applying a learned linear map.
- Debugging & interpretability tools: Linear probes for magnitude/parity can serve as lightweight sanity checks when deploying LLMs in finance, scientific computing, or education platforms.
- Modular system design: Engineers can build “task adapters”—small linear layers that re‑orient the shared relational space to the needs of a downstream application, reducing the need for full model re‑training.
- Safety & bias mitigation: Since the relational geometry is stable, systematic errors (e.g., mis‑ranking large numbers) can be identified and corrected at the relational level rather than hunting through task‑specific weights.
- Cross‑modal extensions: The same geometric principles could be applied to non‑numeric concepts (e.g., dates, units, code tokens), opening avenues for more coherent multi‑task LLM deployments.
Limitations & Future Work
- Scope limited to numbers: While numbers provide a clean testbed, it remains unclear how well the relational‑geometry hypothesis scales to abstract or highly contextual concepts.
- Linear assumption: The analysis hinges on linear mappings; non‑linear transformations might be required for more complex relational structures.
- Static probing: Probes were trained post‑hoc; integrating relational constraints directly into the training objective could yield stronger guarantees.
- Dataset bias: The tasks used relatively simple, synthetic prompts; real‑world numeric language (e.g., financial reports) may introduce noise that disrupts the clean geometry.
- Future directions: Extending the framework to multimodal models (vision‑language), exploring hierarchical relational axes, and designing training regimes that explicitly preserve relational geometry across tasks.
Authors
- Zhimin Hu
- Lanhao Niu
- Sashank Varma
Paper Information
- arXiv ID: 2602.06843v1
- Categories: cs.CL, cs.AI
- Published: February 6, 2026
- PDF: Download PDF