[Paper] A Dataset is Worth 1 MB
Source: arXiv - 2602.23358v1
Overview
The paper introduces PLADA (Pseudo‑Labels as Data), a radical way to share training data without ever sending pixels. By assuming every client already has a big, generic image collection (e.g., ImageNet), the server only needs to ship a tiny list of class labels that point to the most relevant images for a new task. The result: a complete training signal that fits into a sub‑megabyte payload while still delivering strong classification performance.
Key Contributions
- Pixel‑free dataset transmission – eliminates the need to ship raw images, reducing communication to pure label metadata.
- Reference‑dataset pruning – a systematic selection algorithm that extracts the subset of reference images most semantically aligned with the target task.
- Ultra‑compact payload – demonstrates end‑to‑end training with less than 1 MB of transmitted data across ten benchmark datasets.
- Empirical validation – shows that PLADA matches or exceeds the accuracy of traditional dataset distillation and full‑dataset baselines despite the massive compression.
- Framework‑agnostic design – works with any downstream model architecture because the “data” are just labeled examples from the shared reference pool.
Methodology
- Assumption: Every client already stores a large, unlabeled reference corpus (e.g., ImageNet‑1K).
- Task definition: The server receives a new classification problem (target dataset) and extracts its class semantics.
- Semantic matching: Using a pre‑trained feature extractor, the method computes similarity between each reference image and the target class prototypes.
- Pruning & labeling: For each target class, the algorithm selects the top‑k most similar reference images and assigns them the target class label (the “pseudo‑label”).
- Transmission: Only the list of selected image IDs and their new labels is sent – a few hundred kilobytes at most.
- Local training: Clients load the corresponding reference images from their local store, apply the received pseudo‑labels, and train any model they prefer (CNN, transformer, etc.).
The core idea is that the visual content is already present on the client; the server merely tells the client which images to treat as examples for each new class.
Results & Findings
| Target Dataset | Payload Size | Top‑1 Accuracy (PLADA) | Baseline (Full Data) |
|---|---|---|---|
| CIFAR‑10 | 0.8 MB | 93.2 % | 94.5 % |
| Flowers102 | 0.9 MB | 88.7 % | 90.1 % |
| Stanford Cars | 0.7 MB | 84.3 % | 85.6 % |
| … (7 more) | <1 MB each | within 1–2 % of full‑data |
- Across ten diverse benchmarks, PLADA consistently stays within 1–2 % of the accuracy achieved when transmitting the entire training set.
- Compared to state‑of‑the‑art dataset distillation, PLADA reduces the payload by 5–10× while delivering comparable or better performance.
- Ablation studies confirm that the pruning step (selecting semantically relevant images) is the primary driver of both compression and accuracy gains.
Practical Implications
- Edge & IoT deployments: Devices with limited bandwidth (e.g., drones, smartphones, embedded sensors) can receive new visual tasks without heavy downloads.
- Federated learning ecosystems: A central coordinator can broadcast task updates to thousands of participants by sending only label indices, dramatically cutting network load.
- Rapid prototyping: Teams can experiment with new classification problems by re‑using a shared image pool, avoiding the logistics of curating and distributing fresh datasets.
- Cost savings: Cloud providers can lower egress charges for dataset serving, and organizations can reduce storage duplication across sites.
- Privacy‑preserving pipelines: Since raw pixels never leave the client, PLADA aligns with scenarios where data residency rules prohibit moving images across borders.
Limitations & Future Work
- Dependency on a universal reference corpus: The approach assumes all clients have the same large unlabeled dataset; maintaining such a corpus may be impractical for niche domains.
- Semantic gap for highly specialized tasks: When target classes have little visual overlap with the reference set, pruning may struggle to find suitable proxies, hurting accuracy.
- Label noise risk: Pseudo‑labels are inferred from similarity, which can introduce mislabeled examples; future work could incorporate confidence weighting or active verification.
- Extension beyond classification: The current formulation focuses on image classification; adapting PLADA to detection, segmentation, or multimodal tasks remains an open challenge.
Overall, PLADA opens a compelling avenue for ultra‑lightweight dataset distribution, turning the classic “data‑heavy” paradigm on its head and offering a practical tool for developers building scalable, bandwidth‑constrained AI services.
Authors
- Elad Kimchi Shoshani
- Leeyam Gabay
- Yedid Hoshen
Paper Information
- arXiv ID: 2602.23358v1
- Categories: cs.LG, cs.CV
- Published: February 26, 2026
- PDF: Download PDF