[Paper] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
Source: arXiv:2602.11149v1
Overview
The paper investigates a surprising twist in fine‑tuning large language models for chain‑of‑thought (CoT) reasoning:
- Re‑using a small set of high‑quality examples many times can be more effective than feeding the model a massive, one‑pass dataset.
- By keeping the total number of gradient updates constant, the authors show that training for many epochs on a tiny dataset yields substantially higher reasoning accuracy on challenging benchmarks, without causing catastrophic forgetting.
Key Contributions
- Empirical discovery: Under a fixed update budget, repetition beats scaling – training for 128 epochs on 400 examples outperforms a single epoch on 51 200 examples by 12–26 % absolute accuracy on AIME ‘24/‘25 and GPQA.
- Token‑level accuracy as a reliable stopping signal: The point where training loss plateaus (i.e., full memorization of the data) aligns with the peak in downstream reasoning performance.
- Practical recipe for CoT supervised fine‑tuning (SFT): Train on a curated small dataset, monitor token accuracy, and stop once it saturates—eliminating the need for costly data collection and large‑scale training.
- Conceptual framing of the “repetition advantage”: Introduces a new research problem, inviting the community to explain why full memorization can coincide with better generalization in LLMs.
Methodology
Model & Baselines – Experiments use the 7‑billion‑parameter Olmo3 model. Two training regimes are compared:
- Large‑scale: 1 epoch over 51 200 unique CoT examples (≈ 1 × 10⁹ tokens).
- Repetition: 128 epochs over a tiny set of 400 curated CoT examples (≈ 8 × 10⁶ tokens).
Both regimes consume the same total number of gradient updates (≈ 1 × 10⁹ tokens processed).
Data –
- Small dataset: Hand‑selected for diversity and correctness.
- Large dataset: Automatically harvested from existing CoT corpora.
Training Loop – Standard supervised fine‑tuning with cross‑entropy loss. The authors track token‑level accuracy (percentage of tokens predicted correctly) after each epoch.
Evaluation – After fine‑tuning, the models are tested on two high‑difficulty reasoning benchmarks:
- AIME ’24/25 – Advanced mathematics problems.
- GPQA – Grade‑school‑level physics and chemistry multiple‑choice questions.
Accuracy is measured as the proportion of correctly answered questions.
Analysis – Correlate token‑level accuracy curves with downstream benchmark performance to identify the “saturation point.”
Results & Findings
| Training regime | Epochs | # Samples | Token‑accuracy (final) | AIME accuracy ↑ | GPQA accuracy ↑ |
|---|---|---|---|---|---|
| Large‑scale | 1 | 51 200 | ~92 % | Baseline | Baseline |
| Repetition | 128 | 400 | ~99 % (full memorization) | +12–26 pp | +12–26 pp |
- Performance boost – The repeated‑small‑dataset model consistently outperforms the large‑scale baseline on both benchmarks, with gains of 12–26 percentage points.
- No catastrophic forgetting – Despite heavy memorization of the tiny dataset, the model retains its general language abilities (no increase in perplexity on a held‑out language‑modeling set).
- Token‑accuracy as a proxy – The point where token‑accuracy plateaus (≈ 99 %) aligns with the peak in reasoning accuracy, offering a cheap early‑stopping criterion.
- Robustness – The phenomenon holds across different random seeds, optimizer settings, and even when the tiny dataset is shuffled or slightly perturbed.
Practical Implications
Cost‑effective fine‑tuning
Companies can achieve state‑of‑the‑art reasoning performance without gathering millions of annotated CoT examples or renting massive GPU clusters. A curated set of a few hundred high‑quality examples, trained for many epochs, is sufficient.Faster iteration cycles
Because the dataset is tiny, data‑pipeline overhead (pre‑processing, deduplication, storage) shrinks dramatically, enabling rapid experimentation and A/B testing.Simplified data strategy
Teams can focus on quality over quantity—investing effort in curating diverse, well‑explained reasoning traces rather than scaling raw data.Monitoring & early stopping
Implement a simple token‑accuracy monitor during supervised fine‑tuning (SFT); once accuracy plateaus, stop training. This eliminates the need for costly validation runs on large reasoning benchmarks during development.Potential for domain‑specific CoT
The same approach can be applied to specialized domains (e.g., legal reasoning, medical diagnostics) where high‑quality annotated examples are scarce but highly valuable.
Limitations & Future Work
- Dataset size ceiling – The study uses a 400‑sample set, leaving it unclear how the repetition advantage scales to datasets of a few thousand examples.
- Model size dependency – Experiments are limited to a 7 B‑parameter model; larger models (e.g., 30 B, 70 B) may exhibit different dynamics.
- Generalization beyond CoT – Findings are specific to CoT reasoning tasks. It remains an open question whether repetition helps in other fine‑tuning regimes such as instruction following or code generation.
- Theoretical understanding – The “repetition advantage” challenges conventional learning theory. Future work should aim to explain why full memorization can coincide with better out‑of‑distribution reasoning performance.
- Potential over‑fitting to style – Repeated exposure to a narrow set of reasoning styles might bias the model toward those patterns, possibly limiting creativity on novel problem formats.
The authors call for deeper investigations into the training dynamics that enable this phenomenon and for community‑wide benchmarks that can systematically probe the trade‑off between data repetition and scaling.
Authors
- Dawid J. Kopiczko
- Sagar Vaze
- Tijmen Blankevoort
- Yuki M. Asano
Paper Information
| Field | Details |
|---|---|
| arXiv ID | 2602.11149v1 |
| Categories | cs.CL |
| Published | February 11, 2026 |
| Download PDF |