[Paper] Data Science and Technology Towards AGI Part I: Tiered Data Management
Source: arXiv - 2602.09003v1
Overview
The paper proposes a tiered data‑management framework that treats data as a first‑class partner in the journey toward artificial general intelligence (AGI). Instead of the current “more data = bigger model” mindset, the authors argue for a data‑model co‑evolution where large language models (LLMs) help curate, score, and edit the data they later train on. The result is a systematic way to balance data quality, acquisition cost, and training benefit across the entire LLM lifecycle.
Key Contributions
- L0‑L4 tier taxonomy: Defines five data tiers ranging from raw, uncurated text (L0) to highly verified, structured knowledge (L4).
- Model‑in‑the‑loop pipelines: Shows how LLMs can be used for automatic quality scoring, deduplication, and content editing at each tier.
- Lifecycle‑aware allocation: Maps specific tiers to pre‑training, mid‑training, and alignment phases, enabling cost‑effective data usage.
- Empirical validation: Demonstrates that tier‑aware data selection improves training efficiency (fewer compute hours) and downstream performance (higher benchmark scores).
- Open resources: Releases tiered datasets and the associated processing toolkit for the community to reproduce and extend the work.
Methodology
Tier Definition
- L0 – Raw web crawls, logs, or any unfiltered text.
- L1 – Lightly filtered corpora (basic language detection, profanity removal).
- L2 – Moderately curated data with automated quality scores (e.g., perplexity‑based relevance, toxicity filters).
- L3 – Human‑reviewed or model‑edited passages that meet stricter factuality and coherence criteria.
- L4 – Structured knowledge bases, verified facts, and domain‑specific manuals.
Model‑in‑the‑Loop Operations
- Scoring: An LLM estimates a “usefulness” score for each document using prompting (e.g., “How likely is this snippet to improve factual recall?”).
- Editing: The same model rewrites low‑quality sentences, resolves ambiguities, or adds missing citations.
- Deduplication & Filtering: Embedding‑based similarity search removes near‑duplicates across tiers.
Training Phase Mapping
- Pre‑training: Primarily consumes L0‑L2 data to learn broad language patterns.
- Mid‑training (or “curriculum learning”): Introduces L3 data to sharpen factual accuracy and reduce hallucinations.
- Alignment / RLHF: Leverages L4 data for instruction‑following and safety‑critical behavior.
Evaluation Pipeline
- Build tiered subsets from a large public corpus.
- Train identical model architectures with either (a) flat data mixing or (b) tier‑aware scheduling.
- Compare compute‑to‑performance curves on standard LLM benchmarks (e.g., MMLU, TruthfulQA, and OpenAI’s Evals).
Results & Findings
| Metric | Flat‑mix Baseline | Tier‑aware Approach |
|---|---|---|
| Training compute (GPU‑hours) | 1.00× (reference) | 0.78× (22 % reduction) |
| MMLU accuracy (average) | 62.3 % | 64.7 % (+2.4 pts) |
| TruthfulQA (truthfulness) | 48.1 % | 52.9 % (+4.8 pts) |
| Instruction following (Evals) | 71.5 % | 74.2 % (+2.7 pts) |
- Efficiency: By feeding higher‑quality tiers later in training, the model extracts more “bang per buck” from each GPU hour.
- Performance: Tier‑aware curricula consistently improve factuality and instruction compliance without increasing model size.
- Scalability: The framework works with corpora ranging from 10 B to 300 B tokens, indicating it can be applied to both research‑scale and industry‑scale pipelines.
Practical Implications
- Cost‑Effective Scaling: Companies can keep pre‑training data inexpensive (L0‑L2) while reserving premium, curated data (L3‑L4) for the phases that matter most, reducing overall data acquisition spend.
- Continuous Improvement Loops: Deploy an LLM in production, collect user interactions, feed them back into the L3/L4 pipelines for automated editing, and re‑train—creating a virtuous data‑model feedback cycle.
- Regulatory & Safety Benefits: By isolating high‑risk, verified knowledge in L4, organizations can more easily audit the data that influences safety‑critical behavior, aiding compliance with emerging AI regulations.
- Tooling Integration: The released processing toolkit can be plugged into existing data pipelines (e.g., Hugging Face Datasets, Apache Beam) to automate tier assignment and model‑in‑the‑loop refinement.
- Domain Adaptation: Specialized sectors (healthcare, finance) can populate L4 with proprietary, vetted documents, while still leveraging massive public L0‑L2 corpora for general language competence.
Limitations & Future Work
- Quality Scoring Dependence: The approach assumes the LLM’s self‑scoring is reliable; biased or under‑trained models could mis‑rank data, propagating errors.
- Human Oversight Cost: Moving data to L3/L4 still requires manual review or higher‑quality prompts, which may be expensive for niche domains.
- Static Tier Boundaries: The current L0‑L4 definitions are fixed; future work could explore dynamic tiering where data migrates between tiers as model capabilities evolve.
- Broader Modalities: The paper focuses on text; extending tiered management to multimodal data (images, code, audio) remains an open challenge.
- Long‑Term Co‑evolution: While the study demonstrates short‑term gains, a full longitudinal analysis of data‑model co‑evolution across multiple model generations is left for future research.
Authors
- Yudong Wang
- Zixuan Fu
- Hengyu Zhao
- Chen Zhao
- Chuyue Zhou
- Xinle Lin
- Hongya Lyu
- Shuaikang Xue
- Yi Yi
- Yingjiao Wang
- Zhi Zheng
- Yuzhou Zhang
- Jie Zhou
- Chaojun Xiao
- Xu Han
- Zhiyuan Liu
- Maosong Sun
Paper Information
- arXiv ID: 2602.09003v1
- Categories: cs.AI, cs.CL
- Published: February 9, 2026
- PDF: Download PDF