[Paper] HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

Published: 2 months ago (December 10, 2025 at 01:15 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2512.09886v1

Overview

The paper introduces HPM‑KD, a new framework that makes knowledge‑distillation (KD) far more practical for real‑world model compression. By automating hyper‑parameter tuning, smoothing the teacher‑student capacity gap, and leveraging multiple teachers efficiently, HPM‑KD delivers up to 15× smaller models while keeping most of the original accuracy—without the usual trial‑and‑error overhead.

Key Contributions

Adaptive Configuration Manager – a meta‑learning layer that automatically selects KD hyper‑parameters (learning rates, loss weights, etc.), removing the need for manual grid‑search.
Progressive Distillation Chain – builds a cascade of intermediate “mid‑size” student models, automatically determining how many steps are needed to bridge the capacity gap between a large teacher and a tiny student.
Attention‑Weighted Multi‑Teacher Ensemble – learns per‑sample attention scores to combine logits from several teachers, ensuring the most relevant teacher influences each training example.
Meta‑Learned Temperature Scheduler – dynamically adjusts the softmax temperature during training, improving the quality of the softened teacher signals.
Parallel Processing Pipeline – distributes teacher inference and student updates across multiple GPUs/CPU cores with load‑balancing, cutting overall training time by ~30‑40%.
Shared Optimization Memory – caches optimizer states across experiments, enabling rapid re‑use when fine‑tuning or re‑running distillation with different configurations.

All six components are open‑sourced in the DeepBridge library, ready for plug‑and‑play integration.

Methodology

Meta‑Learning for Configuration – Before the actual KD run, a lightweight meta‑learner samples a few candidate hyper‑parameter sets, evaluates a short validation loss, and updates a Bayesian optimizer that predicts the best configuration for the full run.
Progressive Chain Construction – Starting from the large teacher, the system automatically inserts intermediate student models whose capacity is chosen to keep the teacher‑student gap below a predefined threshold. Each intermediate model becomes the teacher for the next step, forming a “progressive ladder.”
Dynamic Multi‑Teacher Fusion – For each training sample, an attention network consumes the raw inputs and the teachers’ logits, outputting a soft weight vector. The weighted sum of logits forms the final soft target for the student.
Temperature Scheduling – A small recurrent network predicts the optimal temperature at each epoch based on training dynamics (e.g., loss curvature), replacing the static temperature used in classic KD.
Parallel Execution – Teacher forward passes are batched and dispatched to idle GPUs/CPU cores. A scheduler monitors queue lengths and redistributes work to avoid bottlenecks.
Shared Memory Optimizer – Optimizer moments (e.g., Adam’s first/second‑order moments) are stored in a shared cache. When a new student model re‑uses a previously trained teacher’s representation, the cache is consulted, accelerating convergence.

The overall training loop remains a standard PyTorch nn.Module forward‑backward pass, so developers can drop HPM‑KD into existing pipelines with minimal code changes.

Results & Findings

Dataset	Teacher (e.g., ResNet‑110)	Student size	Compression	Accuracy Retention*	Training‑time reduction
CIFAR‑10	ResNet‑110 (1.7 M params)	0.12 M (MobileNet‑V2‑0.5x)	14×	85 % of teacher (≈93 % → 79 %)	–32 %
CIFAR‑100	WideResNet‑28‑10 (36 M)	0.9 M (ShuffleNet‑V2)	10×	84 % of teacher (≈78 % → 66 %)	–38 %
Tabular (UCI)	Gradient Boosted Trees (500 M leaves)	0.05 M MLP	12×	86 % of teacher (≈92 % → 79 %)	–30 %

*Accuracy retention is measured as the percentage of the teacher’s original test accuracy that the compressed student still achieves.

Ablation studies show that each component contributes positively: removing the progressive chain drops retention by ~0.6 pp, disabling the attention‑weighted ensemble loses ~0.4 pp, and skipping the meta‑learned temperature costs ~0.2 pp. The adaptive configuration manager alone eliminates up to 90 % of hyper‑parameter search time.

Practical Implications

Faster Model Shipping – Developers can now generate ultra‑lightweight inference models (e.g., for edge devices, mobile apps, or IoT) without spending weeks tuning KD hyper‑parameters.
Multi‑Teacher Ensembles Made Viable – The attention‑weighted fusion lets you profit from several high‑performing teachers (e.g., a vision transformer + a CNN) while keeping the final model tiny, opening doors for hybrid‑knowledge transfer.
Resource‑Efficient Training – Parallel pipelines and shared optimizer states reduce GPU‑hour costs, which is especially valuable for startups or teams with limited cloud budgets.
Plug‑and‑Play Integration – Because HPM‑KD lives as a thin wrapper around standard PyTorch training loops, existing CI/CD pipelines for model updates can adopt it with a few configuration files.
Open‑Source Availability – The DeepBridge implementation means you can inspect, extend, or benchmark the framework against your own proprietary teachers, fostering reproducibility and community contributions.

In short, HPM‑KD turns knowledge distillation from a research curiosity into a production‑ready compression tool.

Limitations & Future Work

Scalability to Very Large Datasets – Experiments are limited to CIFAR‑scale vision and modest tabular data; the authors note that the progressive chain may need additional heuristics for ImageNet‑scale tasks.
Teacher Diversity Assumption – The attention mechanism assumes teachers produce logits of compatible dimensionality; handling heterogeneous output spaces (e.g., classification + detection) remains an open challenge.
Meta‑Learning Overhead – While the configuration manager removes manual tuning, its initial meta‑learning phase still consumes a small fraction of total compute, which could be prohibitive in ultra‑low‑budget settings.
Future Directions – Extending HPM‑KD to self‑supervised pre‑training, exploring neural architecture search for intermediate students, and integrating hardware‑aware latency constraints directly into the progressive chain.

Overall, HPM‑KD offers a compelling step forward for developers who need high‑compression models without the usual engineering headaches, while leaving room for further scaling and specialization.

Authors

Gustavo Coelho Haase
Paulo Henrique Dourado da Silva

Paper Information

arXiv ID: 2512.09886v1
Categories: cs.LG, stat.AP
Published: December 10, 2025
PDF: Download PDF

[Paper] HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously