[Paper] Arithmetic Pedagogy for Language Models
Source: arXiv - 2606.05106v1
Overview
The paper explores whether teaching language models arithmetic the way humans are taught—using step‑by‑step pedagogical methods—can give small models strong calculation abilities. By turning an Indonesian classroom technique (GASING) into a training recipe, the authors show that an 86 M‑parameter GPT‑2 can master basic math without any reinforcement learning, achieving >80 % accuracy on unseen problems and rivaling much larger models.
Key Contributions
- Pedagogical‑driven data generation: Convert the GASING left‑to‑right arithmetic procedure into natural‑language Chain‑of‑Thought (CoT) supervision, creating a curriculum that mirrors human teaching.
- Training a tiny model from scratch: Use a modest GPT‑2 (86 M) with an Indonesian syllabic‑agglutinative tokenizer, trained only with next‑token prediction. No RLHF, reward modeling, or external tool use.
- Learning‑phase analysis: Identify three distinct phases (token memorization, procedural pathway formation, and associative “mental‑arithmetic” emergence) through loss curves and probing.
- Mechanistic interpretability: Apply attention‑masking on the CoT information graph, residual‑stream probing, and logit‑lens inspection to pinpoint where and how the model stores intermediate results.
- Competitive performance: Demonstrate that the small model reaches >80 % accuracy on held‑out arithmetic tasks and competes with much larger LLMs that were fine‑tuned with more complex methods.
Methodology
-
Curriculum design (GASING → CoT):
- Each arithmetic problem (e.g., “23 + 47”) is broken down into a left‑to‑right sequence of elementary operations (add units, carry, etc.).
- The step‑by‑step reasoning is written in natural language, forming a Chain‑of‑Thought that the model will see as part of the training text.
-
Tokenizer & model:
- A custom TOBA tokenizer respects Indonesian syllable‑agglutination, reducing token fragmentation for numeric expressions.
- A standard decoder‑only GPT‑2 architecture (12 layers, 86 M parameters) is initialized from random weights.
-
Training regime:
- Pure next‑token prediction on the generated CoT dataset (≈1 M arithmetic examples).
- No reinforcement learning, no external calculators, and no curriculum‑level weighting beyond the natural ordering of the CoT steps.
-
Analysis toolkit:
- Attention‑masking interventions: Temporarily block attention links that correspond to earlier CoT steps to see if the model still produces correct answers.
- Residual‑stream probing: Train lightweight probes on hidden states to predict intermediate sums, revealing where the model stores partial results.
- Logit‑lens inspection: Visualize the model’s internal “thoughts” by projecting logits back to token space at each layer.
Results & Findings
| Metric | Value |
|---|---|
| Held‑out arithmetic accuracy | > 80 % (vs. ~70 % for a baseline GPT‑2 trained on raw text) |
| Comparison to larger models* | Comparable to 1.3 B‑parameter LLMs fine‑tuned with RLHF on similar tasks |
| Learning phases | 1️⃣ Token memorization (first 10 % of steps) → 2️⃣ Procedural pathway formation (next 30 %) → 3️⃣ Associative retrieval (“mental arithmetic”) (final 60 %) |
| Mechanistic insight | After phase 2, the model can answer correctly even when intermediate CoT steps are masked, indicating it has internalized a compact representation of the arithmetic algorithm. |
*The paper reports that the small model matches or exceeds the performance of larger models that rely on generic pre‑training plus CoT prompting, highlighting the efficiency of a targeted curriculum.
Practical Implications
- Low‑resource arithmetic assistants: Developers can embed a tiny, self‑contained arithmetic module in edge devices (mobile apps, IoT) without needing heavyweight APIs or external calculators.
- Curriculum‑style fine‑tuning: The approach demonstrates a repeatable recipe: translate domain‑specific procedural knowledge into CoT text and fine‑tune a modest model. This can be applied to other step‑wise tasks (e.g., unit conversion, simple physics, code debugging).
- Interpretability‑by‑design: By aligning training data with human pedagogical steps, the resulting model’s reasoning traces are more transparent, easing debugging and compliance checks in regulated industries.
- Cost‑effective development: Training from scratch on a focused dataset requires far less compute than large‑scale RLHF pipelines, making it accessible to startups and research labs with limited GPU budgets.
Limitations & Future Work
- Scope of arithmetic: The study focuses on basic operations (addition, subtraction, multiplication) with relatively small integers; scaling to multi‑digit division, fractions, or higher‑level math remains untested.
- Language specificity: The curriculum and tokenizer are tailored to Indonesian; reproducing the gains in other languages may require custom tokenizers and culturally appropriate pedagogical scripts.
- Generalization to unseen problem formats: While the model handles held‑out numeric values, it may struggle with novel phrasing or mixed‑modal inputs (e.g., tables, spoken language).
- Future directions: Extending the pedagogical pipeline to other domains (algorithmic reasoning, data‑structure manipulation), exploring hybrid training with modest RLHF to push accuracy further, and investigating how the learned procedural pathways transfer when the model is later fine‑tuned for broader NLP tasks.
Authors
- Andhika Bernard Lumbantobing
- Hokky Situngkir
Paper Information
- arXiv ID: 2606.05106v1
- Categories: cs.CL, cs.AI, cs.CY
- Published: June 3, 2026
- PDF: Download PDF