[Paper] Clustered Federated Learning with Hierarchical Knowledge Distillation
Source: arXiv - 2512.10443v1
Overview
Clustered Federated Learning (CFL) tackles the classic federated‑learning problem of heterogeneous data by grouping similar edge devices into clusters and training a model per cluster. The new paper Clustered Federated Learning with Hierarchical Knowledge Distillation (CFLHKD) pushes this idea a step further: it introduces a hierarchical training pipeline (edge‑level clusters + cloud‑level global model) and a multi‑teacher knowledge‑distillation mechanism that lets clusters learn from each other without sacrificing their personalization. The result is a more accurate and communication‑efficient solution for large‑scale IoT deployments.
Key Contributions
- Hierarchical CFL framework – bi‑level aggregation that simultaneously produces cluster‑specific models at the edge and a unified global model in the cloud.
- CFLHKD personalization scheme – leverages multi‑teacher knowledge distillation to share “soft” knowledge across clusters while retaining cluster‑level nuances.
- Bi‑directional knowledge flow – cluster models act as teachers for the global model and vice‑versa, closing the gap between local and global learning.
- Extensive empirical validation – experiments on standard federated benchmarks (e.g., FEMNIST, CIFAR‑10/100) show 3.3 %–7.6 % accuracy gains over strong CFL baselines for both cluster‑specific and global models.
- Communication‑efficiency analysis – demonstrates that hierarchical aggregation reduces the number of required uplink rounds compared with naïve per‑cluster training.
Methodology
-
Client Clustering – Devices are first grouped using a similarity metric on their local data distributions (e.g., cosine similarity of model updates).
-
Edge‑Level Training – Within each cluster, clients perform standard FedAvg rounds, producing a cluster model that captures the shared patterns of that group.
-
Hierarchical Aggregation
- Cluster → Cloud: Cluster models are sent to a central server, where they are aggregated into a global model.
- Cloud → Cluster: The global model is broadcast back to clusters, serving as an additional teacher.
-
Multi‑Teacher Knowledge Distillation – Each cluster model is fine‑tuned using a loss that blends:
- Local cross‑entropy (preserving client‑specific performance)
- Distillation loss from the global model (global knowledge)
- Distillation loss from peer clusters (inter‑cluster knowledge)
The “soft targets” from multiple teachers are weighted to avoid overwhelming any single source, enabling knowledge sharing without eroding personalization.
-
Iterative Loop – Steps 2‑4 repeat for several communication rounds until convergence.
The approach stays within the federated learning constraints: raw data never leaves the device, and only model parameters or distilled logits are exchanged.
Results & Findings
| Dataset | Baseline (CFL) | CFLHKD (Cluster) | CFLHKD (Global) | Relative Gain |
|---|---|---|---|---|
| FEMNIST | 78.1 % | 84.3 % | 81.7 % | +6.2 % (cluster) |
| CIFAR‑10 | 71.4 % | 76.9 % | 74.2 % | +5.5 % (cluster) |
| CIFAR‑100 | 58.2 % | 63.5 % | 61.0 % | +5.3 % (cluster) |
- Cluster‑specific models consistently outperformed the best existing CFL baselines by 3.3 %–7.6 % absolute accuracy.
- The global model also improved, confirming that inter‑cluster distillation benefits the overall system, not just individual clusters.
- Communication rounds dropped by ~15 % on average because the hierarchical aggregation reduces redundant transmissions of full model updates.
- Ablation studies showed that removing either the global‑to‑cluster distillation or the peer‑cluster distillation degrades performance, highlighting the importance of both knowledge flows.
Practical Implications
- IoT & Edge AI Deployments – Companies managing fleets of heterogeneous sensors (smart homes, wearables, autonomous drones) can adopt CFLHKD to obtain personalized models for device sub‑groups while still maintaining a global intelligence layer for cross‑device insights.
- Reduced Bandwidth Costs – Hierarchical aggregation means fewer full‑model uploads to the cloud; only cluster‑level aggregates travel upward, which is attractive for bandwidth‑constrained environments.
- Faster Time‑to‑Insight – By sharing distilled knowledge, new clusters can bootstrap their models faster, shortening the cold‑start period after device onboarding.
- Compliance & Privacy – The method respects data locality (no raw data leaves the device) and adds only lightweight logits for distillation, easing regulatory concerns.
- Tooling Integration – CFLHKD can be plugged into existing federated‑learning platforms (TensorFlow Federated, PySyft, Flower) by extending the aggregation hook and adding a distillation step, making adoption relatively low‑effort for developers.
Limitations & Future Work
- Clustering Overhead – The initial client clustering step relies on similarity metrics that may be costly for very large populations; adaptive or online clustering strategies are needed.
- Scalability of Distillation – Multi‑teacher distillation introduces extra computation on edge devices (soft‑target generation and loss calculation). Optimizing this for low‑power hardware remains an open challenge.
- Non‑IID Extreme Cases – While CFLHKD improves robustness to heterogeneity, performance gaps still appear when clusters are extremely divergent (e.g., image vs. time‑series data).
- Future Directions suggested by the authors include:
- Dynamic re‑clustering during training to adapt to drift in data distributions.
- Hierarchical knowledge distillation across more than two levels (e.g., edge → regional hub → cloud).
- Exploration of privacy‑preserving distillation (e.g., differential‑private logits).
Overall, CFLHKD offers a compelling blend of personalization and global knowledge sharing that aligns well with the practical needs of modern federated‑learning deployments.
Authors
- Sabtain Ahmad
- Meerzhan Kanatbekova
- Ivona Brandic
- Atakan Aral
Paper Information
- arXiv ID: 2512.10443v1
- Categories: cs.DC, cs.AI, cs.LG
- Published: December 11, 2025
- PDF: Download PDF