[Paper] Improving Diversity in Black-box Few-shot Knowledge Distillation

Published: 21 hours ago (April 28, 2026 at 12:03 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.25795v1

Overview

The paper tackles a real‑world bottleneck in knowledge distillation (KD): compressing a large, high‑performing model (the teacher) into a lightweight model (the student) when you can only query the teacher as a black box and have just a handful of labeled images. By introducing a clever way to generate diverse synthetic data on the fly, the authors dramatically improve the student’s accuracy in this “few‑shot, black‑box” setting.

Key Contributions

Adaptive data‑generation loop: A GAN‑based pipeline that continuously selects high‑confidence synthetic images (as judged by the teacher) and feeds them back into the adversarial training process.
Diversity‑driven sampling: The selection strategy explicitly encourages a varied set of synthetic samples, addressing the common pitfall of mode collapse in prior few‑shot KD methods.
State‑of‑the‑art performance: Empirical gains over existing few‑shot KD baselines on seven benchmark image classification datasets (CIFAR‑10/100, Tiny‑ImageNet, etc.).
Open‑source implementation: Full code released, enabling reproducibility and easy integration into existing pipelines.

Methodology

Problem setting – The teacher model is a black box (only forward passes are allowed) and only N real images (e.g., 10–50 per class) are available.
Generator‑Discriminator pair – A conditional GAN is trained to synthesize images conditioned on class labels.
Teacher‑guided selection – After each generator update, a batch of synthetic images is passed through the teacher. Images that receive high confidence (i.e., the teacher’s softmax probability for the intended class exceeds a threshold) are selected.
On‑the‑fly diversity boost – Selected images are immediately injected into the discriminator’s training set, forcing the generator to produce new high‑confidence samples rather than repeatedly reproducing the same modes.
Student training – The student learns from two sources: (a) the limited real images and (b) the growing pool of high‑confidence synthetic images, using the usual KD loss (soft‑target cross‑entropy) together with a standard classification loss.

The loop repeats: generate → filter → train discriminator → update generator → distill to student. Because the teacher’s confidence acts as a quality filter, the synthetic set stays both accurate and diverse without ever needing internal teacher gradients.

Results & Findings

Dataset	# Real Images per Class	Teacher Acc.	Student Acc. (Prev. SOTA)	Student Acc. (Div‑BFKD)
CIFAR‑10	10	94.5%	78.2%	82.6%
CIFAR‑100	5	76.3%	45.1%	49.8%
Tiny‑ImageNet	20	68.9%	38.4%	42.7%
… (4 more)	–	–	–	–

Diversity matters: Ablation studies show that removing the adaptive selection step drops accuracy by 3–5 pts, confirming that varied synthetic data is a key driver.
Efficiency: The GAN training converges within a few thousand iterations; overall runtime is comparable to prior few‑shot KD methods despite the extra selection step.
Robustness: The approach works across different teacher architectures (ResNet‑101, EfficientNet‑B4) and student sizes, indicating broad applicability.

Practical Implications

Edge AI deployment: Developers can now compress a powerful cloud model into a tiny on‑device model using only a handful of collected images, without needing access to the teacher’s weights or gradients.
Privacy‑preserving distillation: Since the teacher is treated as a black box, proprietary models can be shared as APIs while still enabling downstream compression.
Rapid prototyping: The on‑the‑fly generation loop eliminates the need for a large synthetic dataset pre‑generation step, letting teams iterate quickly when data is scarce.
Tooling integration: The released code can be slotted into existing PyTorch pipelines; the selection threshold is a single hyper‑parameter that can be tuned with a validation set of just a few images.

Limitations & Future Work

Reliance on teacher confidence: If the teacher is over‑confident on out‑of‑distribution samples, the selection filter may admit low‑quality images, potentially harming the student.
Scalability to very high‑resolution data: The current GAN architecture targets 32×32–64×64 images; extending to ImageNet‑scale resolutions would require more sophisticated generators.
Few‑shot regime threshold: The method assumes at least a minimal number of real images per class (≈5). Investigating performance under extreme one‑shot conditions remains open.
Broader modalities: Future work could explore applying the diversity‑driven black‑box distillation to NLP or speech models, where synthetic data generation poses different challenges.

Authors

Tri‑Nhan Vo
Dang Nguyen
Kien Do
Sunil Gupta

Paper Information

arXiv ID: 2604.25795v1
Categories: cs.CV, cs.LG
Published: April 28, 2026
PDF: Download PDF

[Paper] Improving Diversity in Black-box Few-shot Knowledge Distillation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control

[Paper] SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

[Paper] Diverse Image Priors for Black-box Data-free Knowledge Distillation

[Paper] Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction