[Paper] Data Science and Technology Towards AGI Part I: Tiered Data Management

Published: 3 days ago (February 9, 2026 at 01:47 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.09003v1

Overview

The paper proposes a tiered data‑management framework that treats data as a first‑class partner in the journey toward artificial general intelligence (AGI). Instead of the current “more data = bigger model” mindset, the authors argue for a data‑model co‑evolution where large language models (LLMs) help curate, score, and edit the data they later train on. The result is a systematic way to balance data quality, acquisition cost, and training benefit across the entire LLM lifecycle.

Key Contributions

L0‑L4 tier taxonomy: Defines five data tiers ranging from raw, uncurated text (L0) to highly verified, structured knowledge (L4).
Model‑in‑the‑loop pipelines: Shows how LLMs can be used for automatic quality scoring, deduplication, and content editing at each tier.
Lifecycle‑aware allocation: Maps specific tiers to pre‑training, mid‑training, and alignment phases, enabling cost‑effective data usage.
Empirical validation: Demonstrates that tier‑aware data selection improves training efficiency (fewer compute hours) and downstream performance (higher benchmark scores).
Open resources: Releases tiered datasets and the associated processing toolkit for the community to reproduce and extend the work.

Methodology

Tier Definition
- L0 – Raw web crawls, logs, or any unfiltered text.
- L1 – Lightly filtered corpora (basic language detection, profanity removal).
- L2 – Moderately curated data with automated quality scores (e.g., perplexity‑based relevance, toxicity filters).
- L3 – Human‑reviewed or model‑edited passages that meet stricter factuality and coherence criteria.
- L4 – Structured knowledge bases, verified facts, and domain‑specific manuals.
Model‑in‑the‑Loop Operations
- Scoring: An LLM estimates a “usefulness” score for each document using prompting (e.g., “How likely is this snippet to improve factual recall?”).
- Editing: The same model rewrites low‑quality sentences, resolves ambiguities, or adds missing citations.
- Deduplication & Filtering: Embedding‑based similarity search removes near‑duplicates across tiers.
Training Phase Mapping
- Pre‑training: Primarily consumes L0‑L2 data to learn broad language patterns.
- Mid‑training (or “curriculum learning”): Introduces L3 data to sharpen factual accuracy and reduce hallucinations.
- Alignment / RLHF: Leverages L4 data for instruction‑following and safety‑critical behavior.
Evaluation Pipeline
- Build tiered subsets from a large public corpus.
- Train identical model architectures with either (a) flat data mixing or (b) tier‑aware scheduling.
- Compare compute‑to‑performance curves on standard LLM benchmarks (e.g., MMLU, TruthfulQA, and OpenAI’s Evals).

Results & Findings

Metric	Flat‑mix Baseline	Tier‑aware Approach
Training compute (GPU‑hours)	1.00× (reference)	0.78× (22 % reduction)
MMLU accuracy (average)	62.3 %	64.7 % (+2.4 pts)
TruthfulQA (truthfulness)	48.1 %	52.9 % (+4.8 pts)
Instruction following (Evals)	71.5 %	74.2 % (+2.7 pts)

Efficiency: By feeding higher‑quality tiers later in training, the model extracts more “bang per buck” from each GPU hour.
Performance: Tier‑aware curricula consistently improve factuality and instruction compliance without increasing model size.
Scalability: The framework works with corpora ranging from 10 B to 300 B tokens, indicating it can be applied to both research‑scale and industry‑scale pipelines.

Practical Implications

Cost‑Effective Scaling: Companies can keep pre‑training data inexpensive (L0‑L2) while reserving premium, curated data (L3‑L4) for the phases that matter most, reducing overall data acquisition spend.
Continuous Improvement Loops: Deploy an LLM in production, collect user interactions, feed them back into the L3/L4 pipelines for automated editing, and re‑train—creating a virtuous data‑model feedback cycle.
Regulatory & Safety Benefits: By isolating high‑risk, verified knowledge in L4, organizations can more easily audit the data that influences safety‑critical behavior, aiding compliance with emerging AI regulations.
Tooling Integration: The released processing toolkit can be plugged into existing data pipelines (e.g., Hugging Face Datasets, Apache Beam) to automate tier assignment and model‑in‑the‑loop refinement.
Domain Adaptation: Specialized sectors (healthcare, finance) can populate L4 with proprietary, vetted documents, while still leveraging massive public L0‑L2 corpora for general language competence.

Limitations & Future Work

Quality Scoring Dependence: The approach assumes the LLM’s self‑scoring is reliable; biased or under‑trained models could mis‑rank data, propagating errors.
Human Oversight Cost: Moving data to L3/L4 still requires manual review or higher‑quality prompts, which may be expensive for niche domains.
Static Tier Boundaries: The current L0‑L4 definitions are fixed; future work could explore dynamic tiering where data migrates between tiers as model capabilities evolve.
Broader Modalities: The paper focuses on text; extending tiered management to multimodal data (images, code, audio) remains an open challenge.
Long‑Term Co‑evolution: While the study demonstrates short‑term gains, a full longitudinal analysis of data‑model co‑evolution across multiple model generations is left for future research.

Authors

Yudong Wang
Zixuan Fu
Hengyu Zhao
Chen Zhao
Chuyue Zhou
Xinle Lin
Hongya Lyu
Shuaikang Xue
Yi Yi
Yingjiao Wang
Zhi Zheng
Yuzhou Zhang
Jie Zhou
Chaojun Xiao
Xu Han
Zhiyuan Liu
Maosong Sun

Paper Information

arXiv ID: 2602.09003v1
Categories: cs.AI, cs.CL
Published: February 9, 2026
PDF: Download PDF

[Paper] Data Science and Technology Towards AGI Part I: Tiered Data Management

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] Weight Decay Improves Language Model Plasticity

[Paper] Just on Time: Token-Level Early Stopping for Diffusion Language Models

[Paper] GameDevBench: Evaluating Agentic Capabilities Through Game Development