[Paper] HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression
Source: arXiv - 2512.09886v1
Overview
The paper introduces HPM‑KD, a new framework that makes knowledge‑distillation (KD) far more practical for real‑world model compression. By automating hyper‑parameter tuning, smoothing the teacher‑student capacity gap, and leveraging multiple teachers efficiently, HPM‑KD delivers up to 15× smaller models while keeping most of the original accuracy—without the usual trial‑and‑error overhead.
Key Contributions
- Adaptive Configuration Manager – a meta‑learning layer that automatically selects KD hyper‑parameters (learning rates, loss weights, etc.), removing the need for manual grid‑search.
- Progressive Distillation Chain – builds a cascade of intermediate “mid‑size” student models, automatically determining how many steps are needed to bridge the capacity gap between a large teacher and a tiny student.
- Attention‑Weighted Multi‑Teacher Ensemble – learns per‑sample attention scores to combine logits from several teachers, ensuring the most relevant teacher influences each training example.
- Meta‑Learned Temperature Scheduler – dynamically adjusts the softmax temperature during training, improving the quality of the softened teacher signals.
- Parallel Processing Pipeline – distributes teacher inference and student updates across multiple GPUs/CPU cores with load‑balancing, cutting overall training time by ~30‑40%.
- Shared Optimization Memory – caches optimizer states across experiments, enabling rapid re‑use when fine‑tuning or re‑running distillation with different configurations.
All six components are open‑sourced in the DeepBridge library, ready for plug‑and‑play integration.
Methodology
- Meta‑Learning for Configuration – Before the actual KD run, a lightweight meta‑learner samples a few candidate hyper‑parameter sets, evaluates a short validation loss, and updates a Bayesian optimizer that predicts the best configuration for the full run.
- Progressive Chain Construction – Starting from the large teacher, the system automatically inserts intermediate student models whose capacity is chosen to keep the teacher‑student gap below a predefined threshold. Each intermediate model becomes the teacher for the next step, forming a “progressive ladder.”
- Dynamic Multi‑Teacher Fusion – For each training sample, an attention network consumes the raw inputs and the teachers’ logits, outputting a soft weight vector. The weighted sum of logits forms the final soft target for the student.
- Temperature Scheduling – A small recurrent network predicts the optimal temperature at each epoch based on training dynamics (e.g., loss curvature), replacing the static temperature used in classic KD.
- Parallel Execution – Teacher forward passes are batched and dispatched to idle GPUs/CPU cores. A scheduler monitors queue lengths and redistributes work to avoid bottlenecks.
- Shared Memory Optimizer – Optimizer moments (e.g., Adam’s first/second‑order moments) are stored in a shared cache. When a new student model re‑uses a previously trained teacher’s representation, the cache is consulted, accelerating convergence.
The overall training loop remains a standard PyTorch nn.Module forward‑backward pass, so developers can drop HPM‑KD into existing pipelines with minimal code changes.
Results & Findings
| Dataset | Teacher (e.g., ResNet‑110) | Student size | Compression | Accuracy Retention* | Training‑time reduction |
|---|---|---|---|---|---|
| CIFAR‑10 | ResNet‑110 (1.7 M params) | 0.12 M (MobileNet‑V2‑0.5x) | 14× | 85 % of teacher (≈93 % → 79 %) | –32 % |
| CIFAR‑100 | WideResNet‑28‑10 (36 M) | 0.9 M (ShuffleNet‑V2) | 10× | 84 % of teacher (≈78 % → 66 %) | –38 % |
| Tabular (UCI) | Gradient Boosted Trees (500 M leaves) | 0.05 M MLP | 12× | 86 % of teacher (≈92 % → 79 %) | –30 % |
*Accuracy retention is measured as the percentage of the teacher’s original test accuracy that the compressed student still achieves.
Ablation studies show that each component contributes positively: removing the progressive chain drops retention by ~0.6 pp, disabling the attention‑weighted ensemble loses ~0.4 pp, and skipping the meta‑learned temperature costs ~0.2 pp. The adaptive configuration manager alone eliminates up to 90 % of hyper‑parameter search time.
Practical Implications
- Faster Model Shipping – Developers can now generate ultra‑lightweight inference models (e.g., for edge devices, mobile apps, or IoT) without spending weeks tuning KD hyper‑parameters.
- Multi‑Teacher Ensembles Made Viable – The attention‑weighted fusion lets you profit from several high‑performing teachers (e.g., a vision transformer + a CNN) while keeping the final model tiny, opening doors for hybrid‑knowledge transfer.
- Resource‑Efficient Training – Parallel pipelines and shared optimizer states reduce GPU‑hour costs, which is especially valuable for startups or teams with limited cloud budgets.
- Plug‑and‑Play Integration – Because HPM‑KD lives as a thin wrapper around standard PyTorch training loops, existing CI/CD pipelines for model updates can adopt it with a few configuration files.
- Open‑Source Availability – The DeepBridge implementation means you can inspect, extend, or benchmark the framework against your own proprietary teachers, fostering reproducibility and community contributions.
In short, HPM‑KD turns knowledge distillation from a research curiosity into a production‑ready compression tool.
Limitations & Future Work
- Scalability to Very Large Datasets – Experiments are limited to CIFAR‑scale vision and modest tabular data; the authors note that the progressive chain may need additional heuristics for ImageNet‑scale tasks.
- Teacher Diversity Assumption – The attention mechanism assumes teachers produce logits of compatible dimensionality; handling heterogeneous output spaces (e.g., classification + detection) remains an open challenge.
- Meta‑Learning Overhead – While the configuration manager removes manual tuning, its initial meta‑learning phase still consumes a small fraction of total compute, which could be prohibitive in ultra‑low‑budget settings.
- Future Directions – Extending HPM‑KD to self‑supervised pre‑training, exploring neural architecture search for intermediate students, and integrating hardware‑aware latency constraints directly into the progressive chain.
Overall, HPM‑KD offers a compelling step forward for developers who need high‑compression models without the usual engineering headaches, while leaving room for further scaling and specialization.
Authors
- Gustavo Coelho Haase
- Paulo Henrique Dourado da Silva
Paper Information
- arXiv ID: 2512.09886v1
- Categories: cs.LG, stat.AP
- Published: December 10, 2025
- PDF: Download PDF