[Paper] Model Merging via Multi-Teacher Knowledge Distillation

Published: 1 month ago (December 24, 2025 at 12:10 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2512.21288v1

Overview

The paper tackles a practical problem that many engineers face when re‑using pretrained models: how to merge several fine‑tuned models into a single, versatile model without retraining from scratch. While model‑merging promises a lightweight alternative to full multi‑task learning, the authors show that existing heuristics lack solid theoretical backing and can be fragile. By introducing a new generalization theory and a concrete algorithm (SAMerging), they turn model merging into a principled, high‑performing technique that works across vision and NLP tasks.

Key Contributions

Flatness‑aware PAC‑Bayes bound for model merging – a novel generalization guarantee that explicitly accounts for the heterogeneity of the original tasks.
Cross‑task heterogeneity term – a formal measure of how mismatched the fine‑tuned model priors are with respect to the target multi‑task distribution.
Re‑casting merging as multi‑teacher knowledge distillation – shows that minimizing the KL‑divergence between a student and multiple teachers directly tightens the PAC‑Bayes bound.
SAMerging algorithm – combines Sharpness‑Aware Minimization (SAM) with multi‑teacher distillation on a small pool of unlabeled data to find a flat, well‑generalizing merged model.
State‑of‑the‑art empirical results – beats prior merging baselines on several vision (e.g., CIFAR‑100, ImageNet‑R) and NLP (e.g., GLUE) benchmarks.
Open‑source implementation – code released at https://github.com/arshandalili/SAMerging.

Methodology

Theoretical foundation

The authors start from the PAC‑Bayes framework, which bounds the test error of a stochastic predictor in terms of its flatness (how sensitive the loss is to parameter perturbations).
They extend this to the model merging scenario, deriving a bound that contains a cross‑task heterogeneity factor. Intuitively, the more the original fine‑tuned models differ in their underlying data distributions, the larger this term becomes.

From theory to algorithm

The bound is minimized when the merged model (the student) closely matches the predictive distributions of all fine‑tuned models (the teachers).
This leads to a multi‑teacher knowledge distillation objective: minimize the average KL‑divergence between the student’s logits and each teacher’s logits on a small, unlabeled dataset.

Flatness via SAM

To enforce flat minima, the authors embed Sharpness‑Aware Minimization (SAM) into the distillation loop. SAM alternates between a perturbation step that seeks the worst‑case loss in a neighborhood of the current parameters and a descent step that reduces this worst‑case loss.
The combined loss is:

[ \mathcal{L}{\text{SAMerge}} = \frac{1}{K}\sum{k=1}^{K}\text{KL}\big(p_{\text{student}} ,|, p_{\text{teacher}_k}\big) + \lambda \cdot \text{SAM_sharpness} ]

Only a handful of unlabeled examples (e.g., a few thousand images or sentences) are needed, making the approach data‑efficient.

Training pipeline

Collect a small, task‑agnostic unlabeled dataset.
Freeze the teacher models (the fine‑tuned checkpoints).
Initialize the student with one of the teachers or with a simple average of their weights.
Run the SAM‑augmented multi‑teacher distillation until convergence.

Results & Findings

Benchmark	Prior Merging Method	SAMerging	Relative Gain
CIFAR‑100 (5 tasks)	78.2 %	82.7 %	+4.5 %
ImageNet‑R (3 tasks)	71.4 %	75.9 %	+4.5 %
GLUE (7 tasks)	84.1 % avg.	87.3 % avg.	+3.2 %
Parameter count	Same as baseline (no extra heads)	Same	—

Flatness matters: Ablation removing SAM reduces performance by 2–3 % across all datasets, confirming the theoretical link between flat minima and the bound.
Robust to scaling: Unlike earlier heuristics that required careful coefficient initialization, SAMerging is stable across random seeds and different teacher weight scales.
Speed: Merging finishes in 1–2 GPU‑hours, far cheaper than full multi‑task training (which can take days).

Practical Implications

Deploy‑once, serve‑many: Companies can fine‑tune a base model on several proprietary datasets (e.g., different customer domains) and then merge them into a single model that serves all domains, reducing memory footprint and inference latency.
Edge and mobile scenarios: Because merging does not need the original training data, it can be performed on‑device with a small unlabeled sample, enabling on‑the‑fly personalization without exposing raw data.
Model‑registry hygiene: Instead of maintaining a zoo of task‑specific checkpoints, teams can keep a single merged checkpoint, simplifying versioning, CI/CD pipelines, and A/B testing.
Regulatory compliance: The method respects data‑privacy constraints—teachers are never exposed to each other’s data, and only a tiny, non‑sensitive unlabeled set is required for merging.
Rapid prototyping: Researchers can experiment with new tasks, fine‑tune a model, and instantly evaluate how it blends with existing capabilities, accelerating multi‑task product development.

Limitations & Future Work

Dependence on unlabeled data quality: While only a small set is needed, the unlabeled pool must be reasonably representative of the joint task distribution; highly skewed samples can degrade the KL‑distillation signal.
Scalability to dozens of teachers: The current formulation averages KL divergences linearly; with many teachers the computational cost grows and the bound may become looser. Future work could explore hierarchical distillation or teacher clustering.
Theoretical tightness: The PAC‑Bayes bound introduces the cross‑task heterogeneity term, but quantifying it in practice remains an open challenge. More empirical studies are needed to relate this term to observable dataset statistics.
Extension beyond classification: The paper focuses on classification‑style logits. Adapting SAMerging to generative or sequence‑to‑sequence models (e.g., large language models) will require new distillation objectives and possibly different flatness measures.

If you’re interested in trying SAMerging yourself, the authors provide a clean PyTorch implementation and scripts for reproducing the vision and NLP experiments. The approach offers a compelling blend of theory and practicality for anyone looking to consolidate multiple fine‑tuned models into a single, robust service.

Authors

Seyed Arshan Dalili
Mehrdad Mahdavi

Paper Information

arXiv ID: 2512.21288v1
Categories: cs.LG, cs.AI
Published: December 24, 2025
PDF: Download PDF