[Paper] Clustered Federated Learning with Hierarchical Knowledge Distillation

Published: 1 month ago (December 11, 2025 at 04:08 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10443v1

Overview

Clustered Federated Learning (CFL) tackles the classic federated‑learning problem of heterogeneous data by grouping similar edge devices into clusters and training a model per cluster. The new paper Clustered Federated Learning with Hierarchical Knowledge Distillation (CFLHKD) pushes this idea a step further: it introduces a hierarchical training pipeline (edge‑level clusters + cloud‑level global model) and a multi‑teacher knowledge‑distillation mechanism that lets clusters learn from each other without sacrificing their personalization. The result is a more accurate and communication‑efficient solution for large‑scale IoT deployments.

Key Contributions

Hierarchical CFL framework – bi‑level aggregation that simultaneously produces cluster‑specific models at the edge and a unified global model in the cloud.
CFLHKD personalization scheme – leverages multi‑teacher knowledge distillation to share “soft” knowledge across clusters while retaining cluster‑level nuances.
Bi‑directional knowledge flow – cluster models act as teachers for the global model and vice‑versa, closing the gap between local and global learning.
Extensive empirical validation – experiments on standard federated benchmarks (e.g., FEMNIST, CIFAR‑10/100) show 3.3 %–7.6 % accuracy gains over strong CFL baselines for both cluster‑specific and global models.
Communication‑efficiency analysis – demonstrates that hierarchical aggregation reduces the number of required uplink rounds compared with naïve per‑cluster training.

Methodology

Client Clustering – Devices are first grouped using a similarity metric on their local data distributions (e.g., cosine similarity of model updates).
Edge‑Level Training – Within each cluster, clients perform standard FedAvg rounds, producing a cluster model that captures the shared patterns of that group.
Hierarchical Aggregation
- Cluster → Cloud: Cluster models are sent to a central server, where they are aggregated into a global model.
- Cloud → Cluster: The global model is broadcast back to clusters, serving as an additional teacher.
Multi‑Teacher Knowledge Distillation – Each cluster model is fine‑tuned using a loss that blends:
- Local cross‑entropy (preserving client‑specific performance)
- Distillation loss from the global model (global knowledge)
- Distillation loss from peer clusters (inter‑cluster knowledge)
The “soft targets” from multiple teachers are weighted to avoid overwhelming any single source, enabling knowledge sharing without eroding personalization.
Iterative Loop – Steps 2‑4 repeat for several communication rounds until convergence.

The approach stays within the federated learning constraints: raw data never leaves the device, and only model parameters or distilled logits are exchanged.

Results & Findings

Dataset	Baseline (CFL)	CFLHKD (Cluster)	CFLHKD (Global)	Relative Gain
FEMNIST	78.1 %	84.3 %	81.7 %	+6.2 % (cluster)
CIFAR‑10	71.4 %	76.9 %	74.2 %	+5.5 % (cluster)
CIFAR‑100	58.2 %	63.5 %	61.0 %	+5.3 % (cluster)

Cluster‑specific models consistently outperformed the best existing CFL baselines by 3.3 %–7.6 % absolute accuracy.
The global model also improved, confirming that inter‑cluster distillation benefits the overall system, not just individual clusters.
Communication rounds dropped by ~15 % on average because the hierarchical aggregation reduces redundant transmissions of full model updates.
Ablation studies showed that removing either the global‑to‑cluster distillation or the peer‑cluster distillation degrades performance, highlighting the importance of both knowledge flows.

Practical Implications

IoT & Edge AI Deployments – Companies managing fleets of heterogeneous sensors (smart homes, wearables, autonomous drones) can adopt CFLHKD to obtain personalized models for device sub‑groups while still maintaining a global intelligence layer for cross‑device insights.
Reduced Bandwidth Costs – Hierarchical aggregation means fewer full‑model uploads to the cloud; only cluster‑level aggregates travel upward, which is attractive for bandwidth‑constrained environments.
Faster Time‑to‑Insight – By sharing distilled knowledge, new clusters can bootstrap their models faster, shortening the cold‑start period after device onboarding.
Compliance & Privacy – The method respects data locality (no raw data leaves the device) and adds only lightweight logits for distillation, easing regulatory concerns.
Tooling Integration – CFLHKD can be plugged into existing federated‑learning platforms (TensorFlow Federated, PySyft, Flower) by extending the aggregation hook and adding a distillation step, making adoption relatively low‑effort for developers.

Limitations & Future Work

Clustering Overhead – The initial client clustering step relies on similarity metrics that may be costly for very large populations; adaptive or online clustering strategies are needed.
Scalability of Distillation – Multi‑teacher distillation introduces extra computation on edge devices (soft‑target generation and loss calculation). Optimizing this for low‑power hardware remains an open challenge.
Non‑IID Extreme Cases – While CFLHKD improves robustness to heterogeneity, performance gaps still appear when clusters are extremely divergent (e.g., image vs. time‑series data).
Future Directions suggested by the authors include:
- Dynamic re‑clustering during training to adapt to drift in data distributions.
- Hierarchical knowledge distillation across more than two levels (e.g., edge → regional hub → cloud).
- Exploration of privacy‑preserving distillation (e.g., differential‑private logits).

Overall, CFLHKD offers a compelling blend of personalization and global knowledge sharing that aligns well with the practical needs of modern federated‑learning deployments.

Authors

Sabtain Ahmad
Meerzhan Kanatbekova
Ivona Brandic
Atakan Aral

Paper Information

arXiv ID: 2512.10443v1
Categories: cs.DC, cs.AI, cs.LG
Published: December 11, 2025
PDF: Download PDF

[Paper] Clustered Federated Learning with Hierarchical Knowledge Distillation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously