[Paper] Clustered Federated Learning with Hierarchical Knowledge Distillation

Published: (December 11, 2025 at 04:08 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.10443v1

Overview

Clustered Federated Learning (CFL) tackles the classic federated‑learning problem of heterogeneous data by grouping similar edge devices into clusters and training a model per cluster. The new paper Clustered Federated Learning with Hierarchical Knowledge Distillation (CFLHKD) pushes this idea a step further: it introduces a hierarchical training pipeline (edge‑level clusters + cloud‑level global model) and a multi‑teacher knowledge‑distillation mechanism that lets clusters learn from each other without sacrificing their personalization. The result is a more accurate and communication‑efficient solution for large‑scale IoT deployments.

Key Contributions

  • Hierarchical CFL framework – bi‑level aggregation that simultaneously produces cluster‑specific models at the edge and a unified global model in the cloud.
  • CFLHKD personalization scheme – leverages multi‑teacher knowledge distillation to share “soft” knowledge across clusters while retaining cluster‑level nuances.
  • Bi‑directional knowledge flow – cluster models act as teachers for the global model and vice‑versa, closing the gap between local and global learning.
  • Extensive empirical validation – experiments on standard federated benchmarks (e.g., FEMNIST, CIFAR‑10/100) show 3.3 %–7.6 % accuracy gains over strong CFL baselines for both cluster‑specific and global models.
  • Communication‑efficiency analysis – demonstrates that hierarchical aggregation reduces the number of required uplink rounds compared with naïve per‑cluster training.

Methodology

  1. Client Clustering – Devices are first grouped using a similarity metric on their local data distributions (e.g., cosine similarity of model updates).

  2. Edge‑Level Training – Within each cluster, clients perform standard FedAvg rounds, producing a cluster model that captures the shared patterns of that group.

  3. Hierarchical Aggregation

    • Cluster → Cloud: Cluster models are sent to a central server, where they are aggregated into a global model.
    • Cloud → Cluster: The global model is broadcast back to clusters, serving as an additional teacher.
  4. Multi‑Teacher Knowledge Distillation – Each cluster model is fine‑tuned using a loss that blends:

    • Local cross‑entropy (preserving client‑specific performance)
    • Distillation loss from the global model (global knowledge)
    • Distillation loss from peer clusters (inter‑cluster knowledge)

    The “soft targets” from multiple teachers are weighted to avoid overwhelming any single source, enabling knowledge sharing without eroding personalization.

  5. Iterative Loop – Steps 2‑4 repeat for several communication rounds until convergence.

The approach stays within the federated learning constraints: raw data never leaves the device, and only model parameters or distilled logits are exchanged.

Results & Findings

DatasetBaseline (CFL)CFLHKD (Cluster)CFLHKD (Global)Relative Gain
FEMNIST78.1 %84.3 %81.7 %+6.2 % (cluster)
CIFAR‑1071.4 %76.9 %74.2 %+5.5 % (cluster)
CIFAR‑10058.2 %63.5 %61.0 %+5.3 % (cluster)
  • Cluster‑specific models consistently outperformed the best existing CFL baselines by 3.3 %–7.6 % absolute accuracy.
  • The global model also improved, confirming that inter‑cluster distillation benefits the overall system, not just individual clusters.
  • Communication rounds dropped by ~15 % on average because the hierarchical aggregation reduces redundant transmissions of full model updates.
  • Ablation studies showed that removing either the global‑to‑cluster distillation or the peer‑cluster distillation degrades performance, highlighting the importance of both knowledge flows.

Practical Implications

  • IoT & Edge AI Deployments – Companies managing fleets of heterogeneous sensors (smart homes, wearables, autonomous drones) can adopt CFLHKD to obtain personalized models for device sub‑groups while still maintaining a global intelligence layer for cross‑device insights.
  • Reduced Bandwidth Costs – Hierarchical aggregation means fewer full‑model uploads to the cloud; only cluster‑level aggregates travel upward, which is attractive for bandwidth‑constrained environments.
  • Faster Time‑to‑Insight – By sharing distilled knowledge, new clusters can bootstrap their models faster, shortening the cold‑start period after device onboarding.
  • Compliance & Privacy – The method respects data locality (no raw data leaves the device) and adds only lightweight logits for distillation, easing regulatory concerns.
  • Tooling Integration – CFLHKD can be plugged into existing federated‑learning platforms (TensorFlow Federated, PySyft, Flower) by extending the aggregation hook and adding a distillation step, making adoption relatively low‑effort for developers.

Limitations & Future Work

  • Clustering Overhead – The initial client clustering step relies on similarity metrics that may be costly for very large populations; adaptive or online clustering strategies are needed.
  • Scalability of Distillation – Multi‑teacher distillation introduces extra computation on edge devices (soft‑target generation and loss calculation). Optimizing this for low‑power hardware remains an open challenge.
  • Non‑IID Extreme Cases – While CFLHKD improves robustness to heterogeneity, performance gaps still appear when clusters are extremely divergent (e.g., image vs. time‑series data).
  • Future Directions suggested by the authors include:
    • Dynamic re‑clustering during training to adapt to drift in data distributions.
    • Hierarchical knowledge distillation across more than two levels (e.g., edge → regional hub → cloud).
    • Exploration of privacy‑preserving distillation (e.g., differential‑private logits).

Overall, CFLHKD offers a compelling blend of personalization and global knowledge sharing that aligns well with the practical needs of modern federated‑learning deployments.

Authors

  • Sabtain Ahmad
  • Meerzhan Kanatbekova
  • Ivona Brandic
  • Atakan Aral

Paper Information

  • arXiv ID: 2512.10443v1
  • Categories: cs.DC, cs.AI, cs.LG
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »