[Paper] Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
Source: arXiv - 2602.07824v1
Overview
The paper “Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre‑training” proposes a systematic way to turn raw scientific text—often noisy and hard to learn from—into high‑quality training material for large language models (LLMs). By introducing a ten‑level taxonomy (L0‑L9) that describes how models and data can co‑evolve, the authors demonstrate that carefully refined data can give a measurable boost to foundation models, even when the underlying model architecture stays the same.
Key Contributions
- Data Darwinism taxonomy (L0‑L9): A structured framework that maps the progressive enrichment of data—from raw dumps (L0) to fully reasoned, cognitively complete documents (L5) and beyond.
- Darwin‑Science corpus: A 900‑billion‑token scientific dataset built through the first six taxonomy levels (L0‑L5), showing how generative refinement and cognitive completion can be automated with frontier LLMs.
- Contamination‑free baselines: Creation of “daVinci‑origin” 3B‑ and 7B‑parameter models trained from scratch without any scientific content, ensuring a fair comparison against the refined corpus.
- Empirical gains: After 600 B tokens of continued pre‑training on Darwin‑Science, the models improve by +2.12 / +2.95 points on a broad benchmark suite and +5.60 / +8.40 points on domain‑specific tasks.
- Open‑source release: The authors publish both the refined corpus and the baseline models, inviting the community to experiment with co‑evolutionary data pipelines.
Methodology
-
Taxonomy Definition (L0‑L9):
- L0–L3 cover raw collection, cleaning, deduplication, and basic formatting.
- L4 – Generative Refinement: Uses a strong LLM to rewrite sentences, resolve ambiguities, and inject missing logical steps.
- L5 – Cognitive Completion: The model adds explicit reasoning chains, definitions, and citations, turning terse scientific prose into “self‑explanatory” text.
-
Corpus Construction (Darwin‑Science):
- Start with a massive dump of scientific papers (arXiv, PubMed, etc.).
- Apply L0‑L3 pipelines for noise removal and tokenization.
- Run L4 and L5 stages iteratively, prompting a frontier LLM (e.g., GPT‑4‑class) to rewrite and augment each document.
-
Baseline Model Training:
- Train two “origin” models (3B and 7B parameters) from scratch on a generic, non‑scientific corpus, guaranteeing zero exposure to the scientific domain.
-
Continued Pre‑training:
- Take the origin models and continue training for 600 B tokens on the Darwin‑Science corpus.
- Evaluate after each taxonomy level to isolate the contribution of L4 and L5.
-
Evaluation:
- Use a suite of 20+ downstream benchmarks (general‑purpose and scientific) to measure zero‑shot and few‑shot performance.
- Compare against the untouched origin models and against a version trained on the same raw data without L4/L5 processing.
Results & Findings
| Model | Pre‑training Tokens | Avg. Benchmark Δ (General) | Domain‑Aligned Δ |
|---|---|---|---|
| 3B origin (raw) | – | 0 | 0 |
| 3B + Darwin‑Science (L0‑L5) | 600 B | +2.12 | +5.60 |
| 7B origin (raw) | – | 0 | 0 |
| 7B + Darwin‑Science (L0‑L5) | 600 B | +2.95 | +8.40 |
- Incremental gain from L4/L5: Adding the generative refinement (L4) and cognitive completion (L5) stages contributed an extra +1.36 points on average, confirming that higher‑level processing unlocks latent value in the data.
- Robustness across tasks: Gains were consistent across QA, summarization, and code‑generation tasks that involve scientific reasoning, indicating that the refined data improves both factual recall and reasoning ability.
- No contamination effect: Because the baselines never saw scientific text, the observed improvements can be attributed solely to the quality of the Darwin‑Science corpus, not to accidental data leakage.
Practical Implications
- Better domain‑specific LLMs with existing architectures: Companies can boost the performance of their in‑house models on scientific, medical, or technical domains simply by applying the Data Darwinism pipeline to their corpora, without redesigning the model.
- Cost‑effective data engineering: The taxonomy provides a step‑by‑step recipe that can be automated with existing LLM APIs, turning raw PDFs or LaTeX sources into high‑utility training material at scale.
- Reduced need for massive model scaling: A 3B‑parameter model trained on refined data rivals larger, less‑focused models on domain tasks, opening doors for edge‑deployment or low‑resource environments.
- Improved attribution and reproducibility: By explicitly generating citations and reasoning traces during L5 processing, the resulting datasets are more transparent, helping organizations meet compliance and audit requirements.
- Foundation for “co‑evolutionary” AI pipelines: The concept that better models produce better data, which in turn yields better models, can be iterated—future releases could use the newly trained model to further refine the corpus, creating a virtuous loop.
Limitations & Future Work
- Reliance on strong LLMs for L4/L5: The refinement stages currently need a frontier model (e.g., GPT‑4) to produce high‑quality rewrites, which may be costly or unavailable for some teams.
- Potential hallucinations in generated reasoning: While cognitive completion adds explicit chains of thought, it can also introduce fabricated explanations if the prompting is not carefully controlled.
- Scope limited to scientific literature: The taxonomy is demonstrated on a 900 B‑token scientific corpus; applying it to other domains (legal, code, social media) may require domain‑specific prompt engineering.
- Evaluation focused on benchmark scores: Real‑world downstream impact (e.g., improved literature review pipelines, faster hypothesis generation) remains to be quantified.
- Future directions: The authors plan to extend the taxonomy to L6‑L9 (interactive feedback loops, self‑supervised curriculum learning) and to explore automated quality checks that detect hallucinated reasoning before data enters the training pool.
The Data Darwinism framework offers a pragmatic roadmap for turning massive, messy data dumps into high‑impact training material. By making the data‑model co‑evolution explicit, it empowers developers to extract more mileage from existing model sizes and to accelerate the creation of domain‑aware AI assistants.
Authors
- Yiwei Qin
- Zhen Huang
- Tiantian Mi
- Weiye Si
- Chenyang Zhou
- Qipeng Guo
- Siyuan Feng
- Pengfei Liu
Paper Information
- arXiv ID: 2602.07824v1
- Categories: cs.AI, cs.CL
- Published: February 8, 2026
- PDF: Download PDF