[Paper] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
Source: arXiv - 2602.11149v1
Overview
The paper investigates a surprising twist in fine‑tuning large language models for chain‑of‑thought (CoT) reasoning: re‑using a small set of high‑quality examples many times can be more effective than feeding the model a massive, one‑pass dataset. By keeping the total number of gradient updates constant, the authors show that training for many epochs on a tiny dataset yields substantially higher reasoning accuracy on challenging benchmarks, without causing catastrophic forgetting.
Key Contributions
- Empirical discovery that, under a fixed update budget, repetition beats scaling: 128‑epoch training on 400 examples outperforms a single epoch on 51 200 examples by 12–26 % absolute accuracy on AIME’24/25 and GPQA.
- Token‑level accuracy as a reliable stopping signal: the point where training loss plateaus (full memorization of the data) aligns with the peak in downstream reasoning performance.
- Practical recipe for CoT supervised fine‑tuning (SFT): train on a curated small dataset, monitor token accuracy, and stop once it saturates—eliminating the need for costly data collection and large‑scale training.
- Conceptual framing of the “repetition advantage” as a new research problem, inviting the community to explain why full memorization can coincide with better generalization in LLMs.
Methodology
-
Model & Baselines – Experiments use the 7‑billion‑parameter Olmo3 model. Two training regimes are compared:
- Large‑scale: 1 epoch over 51 200 unique CoT examples (≈ 1 × 10⁹ tokens).
- Repetition: 128 epochs over a tiny set of 400 curated CoT examples (≈ 8 × 10⁶ tokens).
Both regimes consume the same total number of gradient updates (≈ 1 × 10⁹ tokens processed).
-
Data – The small dataset is hand‑selected for diversity and correctness; the large dataset is automatically harvested from existing CoT corpora.
-
Training Loop – Standard supervised fine‑tuning with cross‑entropy loss. The authors track token‑level accuracy (percentage of tokens predicted correctly) after each epoch.
-
Evaluation – After fine‑tuning, the models are tested on two high‑difficulty reasoning benchmarks:
- AIME’24/25 (advanced math problems).
- GPQA (grade‑school‑level physics and chemistry multiple‑choice).
Accuracy is measured as the proportion of correctly answered questions.
-
Analysis – Correlate token‑level accuracy curves with downstream benchmark performance to identify the “saturation point”.
Results & Findings
| Training regime | Epochs | #Samples | Token‑accuracy (final) | AIME accuracy ↑ | GPQA accuracy ↑ |
|---|---|---|---|---|---|
| Large‑scale | 1 | 51 200 | ~92 % | Baseline | Baseline |
| Repetition | 128 | 400 | ~99 % (full memorization) | +12–26 pp | +12–26 pp |
- Performance boost: The repeated‑small‑dataset model consistently outperforms the large‑scale baseline across both benchmarks, with gains ranging from 12 to 26 percentage points.
- No catastrophic forgetting: Despite heavy memorization of the tiny dataset, the model retains its general language abilities (no drop in perplexity on a held‑out language modeling set).
- Token‑accuracy as a proxy: The point where token‑accuracy plateaus (≈ 99 %) aligns with the peak in reasoning accuracy, providing a cheap early‑stopping criterion.
- Robustness: The phenomenon holds across different random seeds, optimizer settings, and even when the tiny dataset is shuffled or slightly perturbed.
Practical Implications
- Cost‑effective fine‑tuning: Companies can achieve state‑of‑the‑art reasoning performance without gathering millions of annotated CoT examples or renting massive GPU clusters. A curated set of a few hundred high‑quality examples, trained for many epochs, suffices.
- Faster iteration cycles: Since the dataset is tiny, data‑pipeline overhead (pre‑processing, deduplication, storage) shrinks dramatically, enabling rapid experimentation and A/B testing.
- Simplified data strategy: Teams can focus on quality over quantity—investing effort in curating diverse, well‑explained reasoning traces rather than scaling raw data.
- Monitoring & stopping: Implement a simple token‑accuracy monitor during SFT; once it reaches a plateau, stop training. This eliminates the need for costly validation runs on large reasoning benchmarks during development.
- Potential for domain‑specific CoT: The same approach could be applied to specialized domains (e.g., legal reasoning, medical diagnostics) where high‑quality annotated examples are scarce but highly valuable.
Limitations & Future Work
- Dataset size ceiling: The study focuses on a 400‑sample set; it remains unclear how the repetition advantage scales when the dataset grows to a few thousand examples.
- Model size dependency: Experiments are limited to a 7 B‑parameter model. Larger models (e.g., 30 B, 70 B) may exhibit different dynamics.
- Generalization beyond CoT: The findings are specific to chain‑of‑thought reasoning tasks; whether repetition helps in other fine‑tuning regimes (e.g., instruction following, code generation) is an open question.
- Theoretical understanding: The “repetition advantage” challenges conventional learning theory. Future work should aim to explain why full memorization can coincide with better out‑of‑distribution reasoning performance.
- Potential over‑fitting to style: Repeated exposure to a narrow set of reasoning styles might bias the model toward those patterns, possibly limiting creativity on novel problem formats.
The authors call for deeper investigations into the training dynamics that enable this phenomenon, and for community‑wide benchmarks that can systematically probe the trade‑off between data repetition and scaling.
Authors
- Dawid J. Kopiczko
- Sagar Vaze
- Tijmen Blankevoort
- Yuki M. Asano
Paper Information
- arXiv ID: 2602.11149v1
- Categories: cs.CL
- Published: February 11, 2026
- PDF: Download PDF