[Paper] Hierarchical Dataset Selection for High-Quality Data Sharing
Source: arXiv - 2512.10952v1
Overview
Modern machine learning models thrive on large, high‑quality training data, but in practice data lives in many separate repositories—think public datasets, corporate data lakes, or cross‑institutional collaborations. This paper formalizes dataset selection: choosing whole datasets (rather than individual samples) from a heterogeneous pool to maximize downstream performance while respecting budget constraints. The authors introduce DaSH (Dataset Selection via Hierarchies), a method that leverages the natural hierarchy of data sources (e.g., collections, institutions) to make smarter, faster selection decisions.
Key Contributions
- Task definition: Formalizes “dataset selection” as a distinct problem from traditional sample‑level data selection, emphasizing the importance of source‑level relevance.
- DaSH algorithm: Proposes a hierarchical utility model that simultaneously evaluates individual datasets and their parent groups, enabling efficient generalization from a few observations.
- Empirical gains: Demonstrates up to 26.2 % higher accuracy on two multi‑domain benchmarks (Digit‑Five, DomainNet) compared with state‑of‑the‑art data selection baselines.
- Sample‑efficient exploration: Shows DaSH needs far fewer exploration steps to converge on a high‑utility subset, cutting down computational and labeling costs.
- Robustness analysis: Provides ablations confirming DaSH works well even when relevant datasets are scarce or resources are extremely limited.
Methodology
- Problem setup – Assume a large pool of datasets, each belonging to a higher‑level group (e.g., a university, a public repository). The goal is to pick a subset of datasets under a fixed budget (e.g., total number of samples, compute time).
- Hierarchical utility model – DaSH learns two utility scores:
- Group utility captures how promising a whole collection is (e.g., “medical imaging labs”).
- Dataset utility refines this by estimating the value of each individual dataset within a selected group.
The model is trained online: after a small batch of datasets is sampled and evaluated on a downstream task, DaSH updates its utility estimates using a bandit‑style feedback loop.
- Selection strategy – At each iteration DaSH first picks the most promising groups (exploration vs. exploitation trade‑off) and then selects the top‑scoring datasets inside those groups. This two‑stage approach dramatically reduces the search space compared with flat, sample‑level selectors.
- Budget enforcement – The algorithm stops once the cumulative cost of selected datasets reaches the pre‑specified budget, ensuring practical feasibility.
Results & Findings
| Benchmark | Baseline (best) | DaSH | Relative ↑ Accuracy | Exploration steps ↓ |
|---|---|---|---|---|
| Digit‑Five | 71.3 % | 89.5 % | +26.2 % | ~30 % of baseline |
| DomainNet | 62.1 % | 78.4 % | +16.3 % | ~35 % of baseline |
- Higher final performance: DaSH consistently outperforms both naive random selection and sophisticated sample‑level selectors.
- Faster convergence: The hierarchical approach reaches near‑optimal performance after only a fraction of the selections required by flat methods.
- Robustness: Even when the pool contains many low‑quality or irrelevant datasets, DaSH avoids them early, preserving the budget for higher‑utility sources.
Practical Implications
- Cross‑institutional collaborations: Organizations can automatically identify which partner datasets are worth ingesting, saving weeks of manual curation.
- Data marketplace integration: Platforms that sell or share datasets can embed DaSH to recommend bundles that maximize a buyer’s model performance under a cost ceiling.
- Continuous learning pipelines: In production systems that periodically ingest new data sources, DaSH can act as a gatekeeper, ensuring only beneficial datasets are added without human oversight.
- Resource‑constrained training: For edge‑AI or on‑premise setups where compute and storage are limited, DaSH helps allocate those scarce resources to the most impactful data.
Limitations & Future Work
- Assumption of clear hierarchy: DaSH relies on a predefined grouping of datasets; in messy real‑world catalogs, constructing such hierarchies may be non‑trivial.
- Scalability to millions of datasets: While exploration steps are reduced, the current experiments involve hundreds of datasets; handling truly massive pools may need additional indexing or distributed implementations.
- Static utility estimation: The model treats utility as stationary during selection; future work could incorporate concept drift where dataset relevance changes over time.
- Extension to multimodal data: The paper focuses on image classification benchmarks; applying DaSH to text, audio, or multimodal datasets will require modality‑specific utility signals.
Bottom line: DaSH offers a pragmatic, hierarchy‑aware framework for picking whole datasets under budget constraints, delivering sizable accuracy boosts while cutting down the trial‑and‑error overhead that plagues current data‑selection pipelines. Developers building data‑centric AI systems can leverage this approach to automate and scale the often‑manual task of curating high‑quality training data.
Authors
- Xiaona Zhou
- Yingyan Zeng
- Ran Jin
- Ismini Lourentzou
Paper Information
- arXiv ID: 2512.10952v1
- Categories: cs.LG, cs.AI
- Published: December 11, 2025
- PDF: Download PDF