[Paper] FedZMG: Efficient Client-Side Optimization in Federated Learning
Source: arXiv - 2602.18384v1
Overview
Federated Learning (FL) lets edge devices train a shared model without sending raw data to a central server. In real‑world deployments the data on each device is often non‑IID (e.g., different users type different words or capture different visual scenes), which causes “client‑drift” and slows convergence. The paper FedZMG: Efficient Client‑Side Optimization in Federated Learning proposes a lightweight, parameter‑free optimizer—Federated Zero‑Mean Gradients (FedZMG)—that mitigates client‑drift by projecting local gradients onto a zero‑mean hyperplane, eliminating systematic bias without extra communication or hyper‑parameter tuning.
Key Contributions
- FedZMG algorithm – a novel client‑side optimizer that centralizes gradients (zero‑mean projection) to reduce variance caused by heterogeneous data.
- Parameter‑free design – no learning‑rate schedules, momentum terms, or extra hyper‑parameters, making it ideal for constrained IoT devices.
- Theoretical guarantees – proof that FedZMG lowers the effective gradient variance and yields tighter convergence bounds than vanilla FedAvg.
- Comprehensive empirical study – experiments on EMNIST, CIFAR‑100, and Shakespeare show faster convergence and higher final accuracy versus FedAvg and FedAdam, especially under severe non‑IID partitions.
- Zero communication overhead – the optimizer operates entirely on the client; the server sees the same aggregated updates as in standard FL, preserving bandwidth.
Methodology
-
Gradient Centralization in FL
- Traditional FL aggregates raw local gradients (or model deltas). FedZMG first computes the mean of a client’s gradient vector and subtracts it, forcing the gradient to lie on a hyperplane whose coordinates sum to zero.
- Mathematically:
[ \tilde{g}=g - \frac{1}{d}\mathbf{1}^{\top}g;\mathbf{1} ]
where (g) is the raw gradient, (d) its dimensionality, and (\mathbf{1}) a vector of ones.
-
Local Update Loop
- Each client performs standard SGD on its local data, but replaces the raw gradient with the zero‑mean version before the weight update. No extra state (e.g., momentum buffers) is stored.
-
Server‑Side Aggregation
- The server receives the usual model updates (or weight deltas) from clients and averages them exactly as in FedAvg. Because the projection is linear, the aggregated update remains unbiased.
-
Theoretical Analysis
- The authors bound the variance of (\tilde{g}) relative to (g) and show that the expected squared norm shrinks by a factor proportional to the data heterogeneity.
- Using standard FL convergence proofs, they derive a tighter bound on the distance to the optimal model after (T) communication rounds.
-
Experimental Setup
- Non‑IID partitions were created by Dirichlet sampling (α = 0.1, 0.5) to simulate realistic client skew.
- Baselines: FedAvg (plain SGD) and FedAdam (adaptive server‑side optimizer).
- Metrics: training loss, validation accuracy, and number of communication rounds to reach a target accuracy.
Results & Findings
| Dataset | Non‑IID level | FedAvg (acc) | FedAdam (acc) | FedZMG (acc) |
|---|---|---|---|---|
| EMNIST | α = 0.1 | 78.3 % | 80.1 % | 82.7 % |
| CIFAR‑100 | α = 0.5 | 45.6 % | 47.2 % | 49.8 % |
| Shakespeare | α = 0.1 | 62.4 % | 64.0 % | 66.5 % |
- Faster convergence – FedZMG reaches 80 % of the final accuracy in ~30 % fewer communication rounds than FedAvg.
- Reduced gradient variance – empirical variance of client updates drops by ~35 % after zero‑mean projection, confirming the theoretical claim.
- Negligible overhead – the extra computation is a single mean subtraction per gradient vector, adding <0.5 ms on a typical ARM Cortex‑M4 microcontroller.
- Robustness to heterogeneity – performance gains increase as data becomes more skewed (lower α), indicating that FedZMG directly addresses client‑drift.
Practical Implications
- Edge‑friendly FL – Since FedZMG requires no extra memory or hyper‑parameter tuning, it can be dropped into existing FL pipelines on smartphones, wearables, or low‑power sensors without code changes on the server.
- Bandwidth savings – Faster convergence translates to fewer global aggregation rounds, cutting down on uplink traffic—a critical factor for cellular or satellite‑connected devices.
- Simplified deployment – Teams can avoid the trial‑and‑error of tuning learning‑rate schedules or momentum for each device class; the optimizer works “out‑of‑the‑box.”
- Security & privacy – The method does not expose additional statistics (e.g., gradient means) to the server, preserving the privacy guarantees of standard FL.
- Compatibility – FedZMG can be combined with other FL enhancements (e.g., compression, secure aggregation) because it only modifies the local gradient before the usual weight‑delta is sent.
Limitations & Future Work
- Assumption of full‑batch gradient centralization – The analysis assumes gradients are computed over the entire local dataset per round; stochastic mini‑batch variants may introduce bias that needs further study.
- Limited to SGD‑style updates – While the method is trivially applicable to Adam‑style local optimizers, the paper does not explore this direction.
- Evaluation on larger-scale real‑world deployments – Experiments are confined to academic benchmarks; testing on production‑scale FL (e.g., keyboard prediction across millions of users) would validate scalability.
- Potential interaction with differential privacy – Adding noise for privacy may interfere with the zero‑mean property; future work could analyze combined effects.
Overall, FedZMG offers a pragmatic, low‑cost way to boost federated learning performance on heterogeneous, resource‑constrained devices, opening the door for more responsive and privacy‑preserving AI services at the edge.
Authors
- Fotios Zantalis
- Evangelos Zervas
- Grigorios Koulouras
Paper Information
- arXiv ID: 2602.18384v1
- Categories: cs.LG, cs.AI
- Published: February 20, 2026
- PDF: Download PDF