[Paper] FedZMG: Efficient Client-Side Optimization in Federated Learning

Published: 2 months ago (February 20, 2026 at 12:45 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

Federated Learning (FL) enables edge devices to train a shared model without transmitting raw data to a central server. In real‑world deployments, the data on each device is often non‑IID (e.g., different users type different words or capture different visual scenes). This heterogeneity causes client‑drift and slows convergence.

The paper FedZMG: Efficient Client‑Side Optimization in Federated Learning introduces a lightweight, parameter‑free optimizer—Federated Zero‑Mean Gradients (FedZMG)—that mitigates client‑drift by projecting local gradients onto a zero‑mean hyperplane. This eliminates systematic bias without extra communication or hyper‑parameter tuning.

Key Contributions

FedZMG algorithm – a novel client‑side optimizer that centralizes gradients (zero‑mean projection) to reduce variance caused by heterogeneous data.
Parameter‑free design – no learning‑rate schedules, momentum terms, or extra hyper‑parameters, making it ideal for constrained IoT devices.
Theoretical guarantees – proof that FedZMG lowers the effective gradient variance and yields tighter convergence bounds than vanilla FedAvg.
Comprehensive empirical study – experiments on EMNIST, CIFAR‑100, and Shakespeare show faster convergence and higher final accuracy versus FedAvg and FedAdam, especially under severe non‑IID partitions.
Zero communication overhead – the optimizer operates entirely on the client; the server sees the same aggregated updates as in standard FL, preserving bandwidth.

Methodology

Gradient Centralization in FL
- Traditional FL aggregates raw local gradients (or model deltas). FedZMG first computes the mean of a client’s gradient vector and subtracts it, forcing the gradient to lie on a hyperplane whose coordinates sum to zero.
- Mathematically
  
  [ \tilde{g}=g-\frac{1}{d}\mathbf{1}^{\top}g;\mathbf{1} ]
  
  where (g) is the raw gradient, (d) its dimensionality, and (\mathbf{1}) a vector of ones.
Local Update Loop
- Each client performs standard SGD on its local data, but replaces the raw gradient with the zero‑mean version before the weight update. No extra state (e.g., momentum buffers) is stored.
Server‑Side Aggregation
- The server receives the usual model updates (or weight deltas) from clients and averages them exactly as in FedAvg. Because the projection is linear, the aggregated update remains unbiased.
Theoretical Analysis
- The authors bound the variance of (\tilde{g}) relative to (g) and show that the expected squared norm shrinks by a factor proportional to the data heterogeneity.
- Using standard FL convergence proofs, they derive a tighter bound on the distance to the optimal model after (T) communication rounds.
Experimental Setup
- Data partitioning: Non‑IID partitions were created by Dirichlet sampling ((\alpha = 0.1,;0.5)) to simulate realistic client skew.
- Baselines: FedAvg (plain SGD) and FedAdam (adaptive server‑side optimizer).
- Metrics: Training loss, validation accuracy, and number of communication rounds required to reach a target accuracy.

Results & Findings

Dataset	Non‑IID level	FedAvg (acc)	FedAdam (acc)	FedZMG (acc)
EMNIST	α = 0.1	78.3 %	80.1 %	82.7 %
CIFAR‑100	α = 0.5	45.6 %	47.2 %	49.8 %
Shakespeare	α = 0.1	62.4 %	64.0 %	66.5 %

Faster convergence – FedZMG reaches 80 % of the final accuracy in ~30 % fewer communication rounds than FedAvg.
Reduced gradient variance – Empirical variance of client updates drops by ~35 % after zero‑mean projection, confirming the theoretical claim.
Negligible overhead – The extra computation is a single mean subtraction per gradient vector, adding < 0.5 ms on a typical ARM Cortex‑M4 microcontroller.
Robustness to heterogeneity – Performance gains increase as data becomes more skewed (lower α), indicating that FedZMG directly addresses client‑drift.

Practical Implications

Edge‑friendly FL – FedZMG requires no extra memory or hyper‑parameter tuning, so it can be dropped into existing FL pipelines on smartphones, wearables, or low‑power sensors without any code changes on the server.
Bandwidth savings – Faster convergence means fewer global aggregation rounds, reducing uplink traffic—a critical factor for cellular or satellite‑connected devices.
Simplified deployment – Teams can avoid the trial‑and‑error of tuning learning‑rate schedules or momentum for each device class; the optimizer works “out‑of‑the‑box.”
Security & privacy – The method does not expose additional statistics (e.g., gradient means) to the server, preserving the privacy guarantees of standard FL.
Compatibility – FedZMG can be combined with other FL enhancements (e.g., compression, secure aggregation) because it only modifies the local gradient before the usual weight‑delta is sent.

Limitations & Future Work

Assumption of full‑batch gradient centralization – The analysis assumes gradients are computed over the entire local dataset per round; stochastic mini‑batch variants may introduce bias that needs further study.
Limited to SGD‑style updates – While the method is trivially applicable to Adam‑style local optimizers, the paper does not explore this direction.
Evaluation on larger‑scale real‑world deployments – Experiments are confined to academic benchmarks; testing on production‑scale FL (e.g., keyboard prediction across millions of users) would validate scalability.
Potential interaction with differential privacy – Adding noise for privacy may interfere with the zero‑mean property; future work could analyze combined effects.

Overall, FedZMG offers a pragmatic, low‑cost way to boost federated learning performance on heterogeneous, resource‑constrained devices, opening the door for more responsive and privacy‑preserving AI services at the edge.

Authors

Fotios Zantalis
Evangelos Zervas
Grigorios Koulouras

Paper Information

Field	Details
arXiv ID	2602.18384v1
Categories	cs.LG, cs.AI
Published	February 20, 2026
PDF	Download PDF

[Paper] FedZMG: Efficient Client-Side Optimization in Federated Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Benchmarking Graph Neural Networks in Solving Hard Constraint Satisfaction Problems

[Paper] Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning

[Paper] Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning

[Paper] Heterogeneous Federated Fine-Tuning with Parallel One-Rank Adaptation