[Paper] Improving Diversity in Black-box Few-shot Knowledge Distillation
Source: arXiv - 2604.25795v1
Overview
The paper tackles a real‑world bottleneck in knowledge distillation (KD): compressing a large, high‑performing model (the teacher) into a lightweight model (the student) when you can only query the teacher as a black box and have just a handful of labeled images. By introducing a clever way to generate diverse synthetic data on the fly, the authors dramatically improve the student’s accuracy in this “few‑shot, black‑box” setting.
Key Contributions
- Adaptive data‑generation loop: A GAN‑based pipeline that continuously selects high‑confidence synthetic images (as judged by the teacher) and feeds them back into the adversarial training process.
- Diversity‑driven sampling: The selection strategy explicitly encourages a varied set of synthetic samples, addressing the common pitfall of mode collapse in prior few‑shot KD methods.
- State‑of‑the‑art performance: Empirical gains over existing few‑shot KD baselines on seven benchmark image classification datasets (CIFAR‑10/100, Tiny‑ImageNet, etc.).
- Open‑source implementation: Full code released, enabling reproducibility and easy integration into existing pipelines.
Methodology
- Problem setting – The teacher model is a black box (only forward passes are allowed) and only N real images (e.g., 10–50 per class) are available.
- Generator‑Discriminator pair – A conditional GAN is trained to synthesize images conditioned on class labels.
- Teacher‑guided selection – After each generator update, a batch of synthetic images is passed through the teacher. Images that receive high confidence (i.e., the teacher’s softmax probability for the intended class exceeds a threshold) are selected.
- On‑the‑fly diversity boost – Selected images are immediately injected into the discriminator’s training set, forcing the generator to produce new high‑confidence samples rather than repeatedly reproducing the same modes.
- Student training – The student learns from two sources: (a) the limited real images and (b) the growing pool of high‑confidence synthetic images, using the usual KD loss (soft‑target cross‑entropy) together with a standard classification loss.
The loop repeats: generate → filter → train discriminator → update generator → distill to student. Because the teacher’s confidence acts as a quality filter, the synthetic set stays both accurate and diverse without ever needing internal teacher gradients.
Results & Findings
| Dataset | # Real Images per Class | Teacher Acc. | Student Acc. (Prev. SOTA) | Student Acc. (Div‑BFKD) |
|---|---|---|---|---|
| CIFAR‑10 | 10 | 94.5% | 78.2% | 82.6% |
| CIFAR‑100 | 5 | 76.3% | 45.1% | 49.8% |
| Tiny‑ImageNet | 20 | 68.9% | 38.4% | 42.7% |
| … (4 more) | – | – | – | – |
- Diversity matters: Ablation studies show that removing the adaptive selection step drops accuracy by 3–5 pts, confirming that varied synthetic data is a key driver.
- Efficiency: The GAN training converges within a few thousand iterations; overall runtime is comparable to prior few‑shot KD methods despite the extra selection step.
- Robustness: The approach works across different teacher architectures (ResNet‑101, EfficientNet‑B4) and student sizes, indicating broad applicability.
Practical Implications
- Edge AI deployment: Developers can now compress a powerful cloud model into a tiny on‑device model using only a handful of collected images, without needing access to the teacher’s weights or gradients.
- Privacy‑preserving distillation: Since the teacher is treated as a black box, proprietary models can be shared as APIs while still enabling downstream compression.
- Rapid prototyping: The on‑the‑fly generation loop eliminates the need for a large synthetic dataset pre‑generation step, letting teams iterate quickly when data is scarce.
- Tooling integration: The released code can be slotted into existing PyTorch pipelines; the selection threshold is a single hyper‑parameter that can be tuned with a validation set of just a few images.
Limitations & Future Work
- Reliance on teacher confidence: If the teacher is over‑confident on out‑of‑distribution samples, the selection filter may admit low‑quality images, potentially harming the student.
- Scalability to very high‑resolution data: The current GAN architecture targets 32×32–64×64 images; extending to ImageNet‑scale resolutions would require more sophisticated generators.
- Few‑shot regime threshold: The method assumes at least a minimal number of real images per class (≈5). Investigating performance under extreme one‑shot conditions remains open.
- Broader modalities: Future work could explore applying the diversity‑driven black‑box distillation to NLP or speech models, where synthetic data generation poses different challenges.
Authors
- Tri‑Nhan Vo
- Dang Nguyen
- Kien Do
- Sunil Gupta
Paper Information
- arXiv ID: 2604.25795v1
- Categories: cs.CV, cs.LG
- Published: April 28, 2026
- PDF: Download PDF