[Paper] Diverse Image Priors for Black-box Data-free Knowledge Distillation

Published: (April 28, 2026 at 12:02 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.25794v1

Overview

The paper tackles a tough problem: how to train a lightweight “student” model when you can only query a proprietary “teacher” model for its top‑1 label and you have no access to the original training data. This black‑box, data‑free knowledge distillation scenario is increasingly common in privacy‑sensitive or edge‑computing deployments. The authors introduce Diverse Image Priors Knowledge Distillation (DIP‑KD), a three‑stage pipeline that synthesizes varied visual inputs, sharpens their differences with contrastive learning, and finally distills richer soft‑probability signals using a specially designed “primer” student.

Key Contributions

  • Diverse Image Priors (DIP): A generative routine that creates synthetic images covering a broad range of visual patterns and semantics, mitigating the homogeneity problem of earlier synthetic‑data KD methods.
  • Contrastive Enhancement: Introduces a contrastive loss that forces the synthetic samples to be mutually distinctive, boosting the teacher’s informative responses.
  • Primer Student Architecture: A lightweight auxiliary student that first learns from the teacher’s hard top‑1 predictions, then produces soft logits that guide the final student model, effectively extracting richer knowledge from a black‑box teacher.
  • Comprehensive Evaluation: Experiments on 12 diverse benchmarks (image classification, fine‑grained tasks, and robustness tests) show DIP‑KD outperforms prior data‑free KD approaches by a sizable margin.
  • Ablation Study on Diversity: Demonstrates that increasing synthetic data diversity directly correlates with higher student accuracy, confirming the central hypothesis.

Methodology

  1. Synthetic Prior Generation

    • Starts from random noise and iteratively optimizes a generator to produce images that trigger high confidence predictions across many teacher classes.
    • Uses class‑agnostic and class‑conditional objectives to ensure both generic visual textures and class‑specific semantics appear in the generated set.
  2. Contrastive Learning Layer

    • Treats each generated image as an anchor and its augmented versions as positives, while other images in the batch act as negatives.
    • A contrastive loss (e.g., InfoNCE) pushes embeddings of different synthetic samples apart, encouraging the teacher to respond with more varied logits.
  3. Primer Student Distillation

    • A tiny “primer” network first receives the teacher’s hard top‑1 label for each synthetic image and learns a rough mapping.
    • The primer then produces soft probability vectors (logits) that approximate the teacher’s hidden confidence distribution.
    • The final student model is trained with standard KD loss (KL‑divergence) using these soft targets, gaining access to richer information than the original black‑box interface provides.

The whole pipeline is iterative: after a few epochs of student training, the generator is refreshed with the updated student’s embeddings, further diversifying the synthetic pool.

Results & Findings

DatasetTeacher (Acc.)Student (No KD)Student (Prior SOTA)Student (DIP‑KD)
CIFAR‑10093.2%68.1%73.4%78.9%
ImageNet‑Subset (100 classes)78.5%55.2%60.1%66.3%
Tiny-ImageNet71.0%44.8%49.7%55.2%
  • Across all 12 benchmarks, DIP‑KD improves student accuracy by 5–9% over the previous best data‑free KD methods.
  • Ablation experiments reveal that removing the contrastive module drops performance by ~2.3%, while using a single‑type prior (only class‑conditional) reduces accuracy by ~3.1%.
  • The primer student contributes an additional ~1.8% gain, confirming that extracting soft probabilities—even indirectly—helps the final student.

Practical Implications

  • Secure Model Deployment: Companies can now compress proprietary vision models for edge devices without exposing training data or internal logits, staying compliant with privacy regulations.
  • Rapid Prototyping: Developers can generate a compact student model on‑the‑fly by simply querying the hosted teacher API, accelerating iteration cycles for mobile or IoT applications.
  • Cross‑Domain Transfer: Since the synthetic priors are not tied to any specific dataset, the same pipeline can be reused when the target domain changes (e.g., from medical imaging to autonomous driving) as long as the teacher’s API remains accessible.
  • Cost Reduction: Eliminating the need for large labeled datasets cuts data collection and annotation expenses, especially valuable for niche domains where data is scarce or expensive.

Limitations & Future Work

  • Computational Overhead: Generating diverse priors and running contrastive updates adds a non‑trivial pre‑training cost compared to classic KD.
  • Dependence on Teacher Confidence: If the teacher’s top‑1 predictions are highly deterministic (low entropy), extracting useful soft signals via the primer becomes harder.
  • Scalability to Very Large Class Spaces: The current approach was validated up to ~1000 classes; extending to models with tens of thousands of categories may require more sophisticated prior sampling strategies.
  • Future Directions: The authors suggest exploring adaptive prior budgets (fewer synthetic images for high‑confidence teachers) and integrating self‑supervised vision transformers as the primer to further boost soft‑logit quality.

DIP‑KD demonstrates that, even when you’re limited to a black‑box API and no data, clever synthesis and contrastive tricks can still unlock most of the teacher’s knowledge—opening a practical path for secure, data‑free model compression.

Authors

  • Tri-Nhan Vo
  • Dang Nguyen
  • Trung Le
  • Kien Do
  • Sunil Gupta

Paper Information

  • arXiv ID: 2604.25794v1
  • Categories: cs.LG, cs.CV
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »