[Paper] Hierarchical Dataset Selection for High-Quality Data Sharing

Published: 1 month ago (December 11, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10952v1

Overview

Modern machine learning models thrive on large, high‑quality training data, but in practice data lives in many separate repositories—think public datasets, corporate data lakes, or cross‑institutional collaborations. This paper formalizes dataset selection: choosing whole datasets (rather than individual samples) from a heterogeneous pool to maximize downstream performance while respecting budget constraints. The authors introduce DaSH (Dataset Selection via Hierarchies), a method that leverages the natural hierarchy of data sources (e.g., collections, institutions) to make smarter, faster selection decisions.

Key Contributions

Task definition: Formalizes “dataset selection” as a distinct problem from traditional sample‑level data selection, emphasizing the importance of source‑level relevance.
DaSH algorithm: Proposes a hierarchical utility model that simultaneously evaluates individual datasets and their parent groups, enabling efficient generalization from a few observations.
Empirical gains: Demonstrates up to 26.2 % higher accuracy on two multi‑domain benchmarks (Digit‑Five, DomainNet) compared with state‑of‑the‑art data selection baselines.
Sample‑efficient exploration: Shows DaSH needs far fewer exploration steps to converge on a high‑utility subset, cutting down computational and labeling costs.
Robustness analysis: Provides ablations confirming DaSH works well even when relevant datasets are scarce or resources are extremely limited.

Methodology

Problem setup – Assume a large pool of datasets, each belonging to a higher‑level group (e.g., a university, a public repository). The goal is to pick a subset of datasets under a fixed budget (e.g., total number of samples, compute time).
Hierarchical utility model – DaSH learns two utility scores:
- Group utility captures how promising a whole collection is (e.g., “medical imaging labs”).
- Dataset utility refines this by estimating the value of each individual dataset within a selected group.
  The model is trained online: after a small batch of datasets is sampled and evaluated on a downstream task, DaSH updates its utility estimates using a bandit‑style feedback loop.
Selection strategy – At each iteration DaSH first picks the most promising groups (exploration vs. exploitation trade‑off) and then selects the top‑scoring datasets inside those groups. This two‑stage approach dramatically reduces the search space compared with flat, sample‑level selectors.
Budget enforcement – The algorithm stops once the cumulative cost of selected datasets reaches the pre‑specified budget, ensuring practical feasibility.

Results & Findings

Benchmark	Baseline (best)	DaSH	Relative ↑ Accuracy	Exploration steps ↓
Digit‑Five	71.3 %	89.5 %	+26.2 %	~30 % of baseline
DomainNet	62.1 %	78.4 %	+16.3 %	~35 % of baseline

Higher final performance: DaSH consistently outperforms both naive random selection and sophisticated sample‑level selectors.
Faster convergence: The hierarchical approach reaches near‑optimal performance after only a fraction of the selections required by flat methods.
Robustness: Even when the pool contains many low‑quality or irrelevant datasets, DaSH avoids them early, preserving the budget for higher‑utility sources.

Practical Implications

Cross‑institutional collaborations: Organizations can automatically identify which partner datasets are worth ingesting, saving weeks of manual curation.
Data marketplace integration: Platforms that sell or share datasets can embed DaSH to recommend bundles that maximize a buyer’s model performance under a cost ceiling.
Continuous learning pipelines: In production systems that periodically ingest new data sources, DaSH can act as a gatekeeper, ensuring only beneficial datasets are added without human oversight.
Resource‑constrained training: For edge‑AI or on‑premise setups where compute and storage are limited, DaSH helps allocate those scarce resources to the most impactful data.

Limitations & Future Work

Assumption of clear hierarchy: DaSH relies on a predefined grouping of datasets; in messy real‑world catalogs, constructing such hierarchies may be non‑trivial.
Scalability to millions of datasets: While exploration steps are reduced, the current experiments involve hundreds of datasets; handling truly massive pools may need additional indexing or distributed implementations.
Static utility estimation: The model treats utility as stationary during selection; future work could incorporate concept drift where dataset relevance changes over time.
Extension to multimodal data: The paper focuses on image classification benchmarks; applying DaSH to text, audio, or multimodal datasets will require modality‑specific utility signals.

Bottom line: DaSH offers a pragmatic, hierarchy‑aware framework for picking whole datasets under budget constraints, delivering sizable accuracy boosts while cutting down the trial‑and‑error overhead that plagues current data‑selection pipelines. Developers building data‑centric AI systems can leverage this approach to automate and scale the often‑manual task of curating high‑quality training data.

Authors

Xiaona Zhou
Yingyan Zeng
Ran Jin
Ismini Lourentzou

Paper Information

arXiv ID: 2512.10952v1
Categories: cs.LG, cs.AI
Published: December 11, 2025
PDF: Download PDF

[Paper] Hierarchical Dataset Selection for High-Quality Data Sharing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously