[Paper] Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

Published: 2 days ago (April 17, 2026 at 10:21 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.16090v1

Overview

Federated Learning (FL) lets millions of edge devices collaboratively train models without sharing raw data, but the reality of flaky connections, battery limits, and user mobility makes synchronization a nightmare. The paper Robust Synchronisation for Federated Learning in the Face of Correlated Device Failure proposes Availability‑Weighted PSP (AW‑PSP), an upgrade to the Probabilistic Synchronous Parallel (PSP) protocol that dynamically re‑weights which devices get sampled based on real‑time availability forecasts and failure‑correlation signals. The result is a more balanced training process that captures data from rarely‑online devices while keeping the system fast and scalable.

Key Contributions

Availability‑Weighted PSP (AW‑PSP): Extends PSP with a probabilistic sampler that incorporates per‑device availability predictions and correlation metrics.
Markov‑based availability predictor: Distinguishes short‑lived (transient) outages from chronic failures, feeding the sampler with up‑to‑date reliability scores.
Decentralized metadata store via DHT: Keeps latency, freshness, and utility scores for each node distributed across the network, avoiding a single point of failure.
Fairness‑aware sampling: Demonstrates reduced variance in label coverage across device groups, mitigating the “rich‑get‑richer” bias of vanilla PSP.
Trace‑driven evaluation: Uses realistic device‑failure logs to show AW‑PSP improves robustness, label coverage, and overall model accuracy compared with standard PSP and naïve FL baselines.

Methodology

Baseline – PSP: In each training round, PSP randomly selects a subset of devices (e.g., 10 % of the fleet) and waits only for those participants, cutting down on straggler delays.
Problem identification: PSP assumes device availability is independent and static. In practice, devices that are frequently offline also tend to hold unique data (e.g., minority language inputs), leading to systematic under‑representation.
Availability modeling:
- Each device maintains a state machine (online → offline → online …) whose transition probabilities are learned online via a simple Markov chain.
- The chain outputs a availability score (high for consistently reachable devices, low for chronic drop‑outs).
Correlation detection: Historical logs are examined for patterns where groups of devices fail together (e.g., same network provider, geographic region). A lightweight correlation matrix is updated continuously.
Weighted sampling: The probability of picking a device i for the next round is:

[ p_i \propto \frac{w_i}{\sum_j w_j}, \quad w_i = \frac{1}{\text{availability}_i} \times \frac{1}{1 + \text{correlation}_i} ]

This boosts the chance of rarely‑online devices while penalizing clusters that tend to fail together.
Decentralized metadata via DHT: Nodes publish their current latency, freshness, and utility scores to a Distributed Hash Table, enabling any coordinator to retrieve up‑to‑date sampling weights without a central registry.
Evaluation pipeline: The authors replayed real‑world device‑availability traces (derived from a large‑scale mobile app) on a simulated FL environment, comparing AW‑PSP against vanilla PSP and a fully synchronous FL baseline.

Results & Findings

Metric	Vanilla PSP	Fully Synchronous FL	AW‑PSP (proposed)
Overall test accuracy	84.2 %	85.0 %	86.3 %
Label coverage (fraction of classes seen per round)	71 %	78 %	84 %
Fairness variance (Std. dev. of per‑device contribution)	0.19	0.12	0.08
Average round latency	1.8 s	4.5 s	2.0 s
Robustness to correlated failures (drop‑rate ↑ 30 %)	62 % accuracy	58 % accuracy	71 % accuracy

Takeaway: By re‑balancing the sampling probabilities, AW‑PSP not only improves model quality but does so without sacrificing the low‑latency advantage of PSP. The fairness variance drop indicates a more equitable contribution from all device cohorts, which is crucial for avoiding hidden bias in production models.

Practical Implications

Better model generalization: Developers deploying FL for voice assistants, predictive keyboards, or IoT anomaly detection will see fewer blind spots caused by under‑sampled user groups.
Scalable edge orchestration: The DHT‑based metadata layer removes the need for a heavyweight central scheduler, fitting naturally into existing peer‑to‑peer or server‑less edge frameworks.
Energy‑aware training: By recognizing chronic failures (often due to battery constraints), the system can avoid repeatedly pinging devices that are unlikely to respond, extending device battery life.
Regulatory compliance: Fairness‑aware sampling helps meet emerging AI accountability standards that require demonstrable mitigation of demographic bias.
Plug‑and‑play upgrade: AW‑PSP can be layered on top of existing FL toolkits (TensorFlow Federated, PySyft) with minimal code changes—primarily a new sampler and a lightweight availability predictor module.

Limitations & Future Work

Prediction overhead: Maintaining per‑device Markov models and correlation matrices adds modest CPU and storage costs on the coordinator side; the paper suggests off‑loading this to edge nodes as a next step.
Assumption of stationary correlation: The current correlation estimator treats patterns as slowly varying; rapid network‑level changes (e.g., a sudden carrier outage) may temporarily degrade sampling quality.
Evaluation on synthetic traces: While the authors used real‑world logs, the full end‑to‑end impact on live production FL pipelines (with secure aggregation, differential privacy, etc.) remains to be validated.
Security considerations: Exposing availability scores in a DHT could be gamed by malicious actors; future work could explore cryptographic proofs or reputation systems to harden the metadata layer.

Overall, AW‑PSP offers a pragmatic path toward more robust, fair, and efficient federated learning in the wild—an advance that developers building next‑generation edge AI should keep on their radar.

Authors

Stefan Behfar
Richard Mortier

Paper Information

arXiv ID: 2604.16090v1
Categories: cs.DC, cs.AI
Published: April 17, 2026
PDF: Download PDF

[Paper] Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ASMR-Bench: Auditing for Sabotage in ML Research

[Paper] Geometric regularization of autoencoders via observed stochastic dynamics

[Paper] Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

[Paper] Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design