[Paper] Data Science and Technology Towards AGI Part I: Tiered Data Management

Published: (February 9, 2026 at 01:47 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.09003v1

Overview

The paper proposes a tiered data‑management framework that treats data as a first‑class partner in the journey toward artificial general intelligence (AGI). Instead of the current “more data = bigger model” mindset, the authors argue for a data‑model co‑evolution where large language models (LLMs) help curate, score, and edit the data they later train on. The result is a systematic way to balance data quality, acquisition cost, and training benefit across the entire LLM lifecycle.

Key Contributions

  • L0‑L4 tier taxonomy: Defines five data tiers ranging from raw, uncurated text (L0) to highly verified, structured knowledge (L4).
  • Model‑in‑the‑loop pipelines: Shows how LLMs can be used for automatic quality scoring, deduplication, and content editing at each tier.
  • Lifecycle‑aware allocation: Maps specific tiers to pre‑training, mid‑training, and alignment phases, enabling cost‑effective data usage.
  • Empirical validation: Demonstrates that tier‑aware data selection improves training efficiency (fewer compute hours) and downstream performance (higher benchmark scores).
  • Open resources: Releases tiered datasets and the associated processing toolkit for the community to reproduce and extend the work.

Methodology

  1. Tier Definition

    • L0 – Raw web crawls, logs, or any unfiltered text.
    • L1 – Lightly filtered corpora (basic language detection, profanity removal).
    • L2 – Moderately curated data with automated quality scores (e.g., perplexity‑based relevance, toxicity filters).
    • L3 – Human‑reviewed or model‑edited passages that meet stricter factuality and coherence criteria.
    • L4 – Structured knowledge bases, verified facts, and domain‑specific manuals.
  2. Model‑in‑the‑Loop Operations

    • Scoring: An LLM estimates a “usefulness” score for each document using prompting (e.g., “How likely is this snippet to improve factual recall?”).
    • Editing: The same model rewrites low‑quality sentences, resolves ambiguities, or adds missing citations.
    • Deduplication & Filtering: Embedding‑based similarity search removes near‑duplicates across tiers.
  3. Training Phase Mapping

    • Pre‑training: Primarily consumes L0‑L2 data to learn broad language patterns.
    • Mid‑training (or “curriculum learning”): Introduces L3 data to sharpen factual accuracy and reduce hallucinations.
    • Alignment / RLHF: Leverages L4 data for instruction‑following and safety‑critical behavior.
  4. Evaluation Pipeline

    • Build tiered subsets from a large public corpus.
    • Train identical model architectures with either (a) flat data mixing or (b) tier‑aware scheduling.
    • Compare compute‑to‑performance curves on standard LLM benchmarks (e.g., MMLU, TruthfulQA, and OpenAI’s Evals).

Results & Findings

MetricFlat‑mix BaselineTier‑aware Approach
Training compute (GPU‑hours)1.00× (reference)0.78× (22 % reduction)
MMLU accuracy (average)62.3 %64.7 % (+2.4 pts)
TruthfulQA (truthfulness)48.1 %52.9 % (+4.8 pts)
Instruction following (Evals)71.5 %74.2 % (+2.7 pts)
  • Efficiency: By feeding higher‑quality tiers later in training, the model extracts more “bang per buck” from each GPU hour.
  • Performance: Tier‑aware curricula consistently improve factuality and instruction compliance without increasing model size.
  • Scalability: The framework works with corpora ranging from 10 B to 300 B tokens, indicating it can be applied to both research‑scale and industry‑scale pipelines.

Practical Implications

  • Cost‑Effective Scaling: Companies can keep pre‑training data inexpensive (L0‑L2) while reserving premium, curated data (L3‑L4) for the phases that matter most, reducing overall data acquisition spend.
  • Continuous Improvement Loops: Deploy an LLM in production, collect user interactions, feed them back into the L3/L4 pipelines for automated editing, and re‑train—creating a virtuous data‑model feedback cycle.
  • Regulatory & Safety Benefits: By isolating high‑risk, verified knowledge in L4, organizations can more easily audit the data that influences safety‑critical behavior, aiding compliance with emerging AI regulations.
  • Tooling Integration: The released processing toolkit can be plugged into existing data pipelines (e.g., Hugging Face Datasets, Apache Beam) to automate tier assignment and model‑in‑the‑loop refinement.
  • Domain Adaptation: Specialized sectors (healthcare, finance) can populate L4 with proprietary, vetted documents, while still leveraging massive public L0‑L2 corpora for general language competence.

Limitations & Future Work

  • Quality Scoring Dependence: The approach assumes the LLM’s self‑scoring is reliable; biased or under‑trained models could mis‑rank data, propagating errors.
  • Human Oversight Cost: Moving data to L3/L4 still requires manual review or higher‑quality prompts, which may be expensive for niche domains.
  • Static Tier Boundaries: The current L0‑L4 definitions are fixed; future work could explore dynamic tiering where data migrates between tiers as model capabilities evolve.
  • Broader Modalities: The paper focuses on text; extending tiered management to multimodal data (images, code, audio) remains an open challenge.
  • Long‑Term Co‑evolution: While the study demonstrates short‑term gains, a full longitudinal analysis of data‑model co‑evolution across multiple model generations is left for future research.

Authors

  • Yudong Wang
  • Zixuan Fu
  • Hengyu Zhao
  • Chen Zhao
  • Chuyue Zhou
  • Xinle Lin
  • Hongya Lyu
  • Shuaikang Xue
  • Yi Yi
  • Yingjiao Wang
  • Zhi Zheng
  • Yuzhou Zhang
  • Jie Zhou
  • Chaojun Xiao
  • Xu Han
  • Zhiyuan Liu
  • Maosong Sun

Paper Information

  • arXiv ID: 2602.09003v1
  • Categories: cs.AI, cs.CL
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »