[Paper] Improving Diversity in Black-box Few-shot Knowledge Distillation

Published: (April 28, 2026 at 12:03 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.25795v1

Overview

The paper tackles a real‑world bottleneck in knowledge distillation (KD): compressing a large, high‑performing model (the teacher) into a lightweight model (the student) when you can only query the teacher as a black box and have just a handful of labeled images. By introducing a clever way to generate diverse synthetic data on the fly, the authors dramatically improve the student’s accuracy in this “few‑shot, black‑box” setting.

Key Contributions

  • Adaptive data‑generation loop: A GAN‑based pipeline that continuously selects high‑confidence synthetic images (as judged by the teacher) and feeds them back into the adversarial training process.
  • Diversity‑driven sampling: The selection strategy explicitly encourages a varied set of synthetic samples, addressing the common pitfall of mode collapse in prior few‑shot KD methods.
  • State‑of‑the‑art performance: Empirical gains over existing few‑shot KD baselines on seven benchmark image classification datasets (CIFAR‑10/100, Tiny‑ImageNet, etc.).
  • Open‑source implementation: Full code released, enabling reproducibility and easy integration into existing pipelines.

Methodology

  1. Problem setting – The teacher model is a black box (only forward passes are allowed) and only N real images (e.g., 10–50 per class) are available.
  2. Generator‑Discriminator pair – A conditional GAN is trained to synthesize images conditioned on class labels.
  3. Teacher‑guided selection – After each generator update, a batch of synthetic images is passed through the teacher. Images that receive high confidence (i.e., the teacher’s softmax probability for the intended class exceeds a threshold) are selected.
  4. On‑the‑fly diversity boost – Selected images are immediately injected into the discriminator’s training set, forcing the generator to produce new high‑confidence samples rather than repeatedly reproducing the same modes.
  5. Student training – The student learns from two sources: (a) the limited real images and (b) the growing pool of high‑confidence synthetic images, using the usual KD loss (soft‑target cross‑entropy) together with a standard classification loss.

The loop repeats: generate → filter → train discriminator → update generator → distill to student. Because the teacher’s confidence acts as a quality filter, the synthetic set stays both accurate and diverse without ever needing internal teacher gradients.

Results & Findings

Dataset# Real Images per ClassTeacher Acc.Student Acc. (Prev. SOTA)Student Acc. (Div‑BFKD)
CIFAR‑101094.5%78.2%82.6%
CIFAR‑100576.3%45.1%49.8%
Tiny‑ImageNet2068.9%38.4%42.7%
… (4 more)
  • Diversity matters: Ablation studies show that removing the adaptive selection step drops accuracy by 3–5 pts, confirming that varied synthetic data is a key driver.
  • Efficiency: The GAN training converges within a few thousand iterations; overall runtime is comparable to prior few‑shot KD methods despite the extra selection step.
  • Robustness: The approach works across different teacher architectures (ResNet‑101, EfficientNet‑B4) and student sizes, indicating broad applicability.

Practical Implications

  • Edge AI deployment: Developers can now compress a powerful cloud model into a tiny on‑device model using only a handful of collected images, without needing access to the teacher’s weights or gradients.
  • Privacy‑preserving distillation: Since the teacher is treated as a black box, proprietary models can be shared as APIs while still enabling downstream compression.
  • Rapid prototyping: The on‑the‑fly generation loop eliminates the need for a large synthetic dataset pre‑generation step, letting teams iterate quickly when data is scarce.
  • Tooling integration: The released code can be slotted into existing PyTorch pipelines; the selection threshold is a single hyper‑parameter that can be tuned with a validation set of just a few images.

Limitations & Future Work

  • Reliance on teacher confidence: If the teacher is over‑confident on out‑of‑distribution samples, the selection filter may admit low‑quality images, potentially harming the student.
  • Scalability to very high‑resolution data: The current GAN architecture targets 32×32–64×64 images; extending to ImageNet‑scale resolutions would require more sophisticated generators.
  • Few‑shot regime threshold: The method assumes at least a minimal number of real images per class (≈5). Investigating performance under extreme one‑shot conditions remains open.
  • Broader modalities: Future work could explore applying the diversity‑driven black‑box distillation to NLP or speech models, where synthetic data generation poses different challenges.

Authors

  • Tri‑Nhan Vo
  • Dang Nguyen
  • Kien Do
  • Sunil Gupta

Paper Information

  • arXiv ID: 2604.25795v1
  • Categories: cs.CV, cs.LG
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »