[Paper] Lean Clients, Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning

Published: 3 weeks ago (January 13, 2026 at 09:17 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.09076v1

Overview

This paper tackles a long‑standing bottleneck in Split Federated Learning (SFL): the heavy memory and compute load that edge devices must bear during back‑propagation. The authors introduce HERON‑SFL, a hybrid training scheme that swaps the client‑side gradient‑based (first‑order) updates for cheap zeroth‑order (ZO) approximations while keeping the server‑side updates first‑order. The result is a system that can train modern deep nets on thin devices with far less memory and compute, without sacrificing model accuracy.

Key Contributions

Hybrid Optimization Framework – Combines zeroth‑order updates on clients with first‑order updates on the server, preserving overall training fidelity.
Auxiliary Network‑Assisted ZO Updates – Uses lightweight “assistant” networks to generate perturbed forward passes, enabling gradient‑free updates that avoid activation caching.
Theoretical Guarantees – Proves convergence rates that are independent of model dimensionality under a low‑effective‑rank assumption, sidestepping the usual curse of dimensionality in ZO methods.
Empirical Validation – Demonstrates on ResNet image classification and language‑model fine‑tuning that HERON‑SFL attains benchmark accuracy while cutting client peak memory by up to 64 % and per‑step compute by up to 33 %.
Scalability Blueprint – Shows how the approach can extend SFL to models previously out‑of‑reach for edge devices (e.g., larger CNNs, transformer‑based LMs).

Methodology

Split Architecture – The model is divided into a client‑side front‑end and a server‑side back‑end. The client processes raw data, forwards the intermediate activation (the “cut‑layer” output) to the server, and receives the server’s response for loss computation.
Zeroth‑Order Client Update
- Instead of back‑propagating through the client network, each client samples a small set of random perturbations $\delta$ and evaluates the loss only forward with $\mathbf{x} + \delta$.
- Using a finite‑difference estimator (e.g., two‑point or Gaussian smoothing), the client builds an approximate gradient for its parameters without ever storing activations.
- An auxiliary network, much smaller than the main client model, generates the perturbations and provides a low‑cost way to compute the ZO step.
First‑Order Server Update
- The server receives the intermediate activations, computes the true gradient for its own (larger) back‑end, and performs a standard SGD/Adam step.
- Server‑side updates are communicated back to clients, completing one global iteration.
Hybrid Loop – The training loop alternates: clients perform cheap ZO steps locally; the server aggregates and applies FO updates. The process repeats until convergence.

The low‑effective‑rank assumption (the Jacobian of the loss w.r.t. client parameters lies in a low‑dimensional subspace) lets the authors bound the variance of the ZO estimator, yielding a convergence rate that does not blow up with the number of parameters.

Results & Findings

Task	Model	Baseline (FO‑SFL)	HERON‑SFL	Memory Reduction	Compute Reduction
Image classification (CIFAR‑10)	ResNet‑18	93.2 % acc	93.0 % acc	↓ 64 %	↓ 33 %
LM fine‑tuning (GPT‑2 small)	GPT‑2‑124M	84.5 % ppl	84.3 % ppl	↓ 58 %	↓ 30 %

Accuracy Parity – Across all benchmarks, HERON‑SFL stays within 0.2 % of the full first‑order baseline.
Memory Footprint – Peak client memory drops dramatically because no activation maps need to be stored for back‑propagation.
Compute Savings – Each client step requires only forward passes (plus cheap perturbation generation), cutting FLOPs per iteration.
Scalability Test – Training a ResNet‑50 (≈25 M parameters) on a Raspberry Pi‑class device becomes feasible, whereas vanilla SFL crashes due to OOM.

Ablation studies show that auxiliary‑network size and the number of ZO samples trade off overhead against estimator variance.

Practical Implications

Edge‑AI Deployment – Companies can now push more sophisticated models (e.g., vision transformers, medium‑size LMs) to IoT devices, wearables, or smartphones without redesigning the model architecture.
Reduced Bandwidth Costs – Fewer backward messages mean lower uplink traffic, which is crucial for cellular or satellite‑connected devices.
Energy Efficiency – Forward‑only computation consumes less power, extending battery life for on‑device learning scenarios (personalization, continual learning).
Simplified SDKs – Developers can integrate HERON‑SFL into existing federated‑learning frameworks (TensorFlow Federated, PySyft) with minimal changes: swap the client optimizer for a ZO wrapper.
Regulatory & Privacy Benefits – By keeping more computation on‑device and limiting gradient leakage, the approach aligns well with privacy‑by‑design mandates (GDPR, HIPAA).

Limitations & Future Work

Zeroth‑Order Variance – Although the low‑rank assumption mitigates it, ZO estimators still introduce extra stochasticity, which may affect convergence speed on highly non‑convex tasks.
Auxiliary Network Overhead – The helper network adds parameters and inference cost; finding the optimal size for diverse hardware remains an open engineering problem.
Server Load – The server still performs full back‑propagation for the back‑end, which could become a bottleneck in massive‑scale deployments.
Theoretical Scope – The convergence proof assumes smooth losses and bounded perturbations; extending it to non‑smooth objectives (e.g., quantized models) is future work.
Broader Benchmarks – Experiments focus on image classification and LM fine‑tuning; evaluating HERON‑SFL on reinforcement learning, graph neural networks, or multimodal models would strengthen its generality.

Authors

Zhoubin Kou
Zihan Chen
Jing Yang
Cong Shen

Paper Information

arXiv ID: 2601.09076v1
Categories: cs.LG, cs.DC, cs.IT, cs.NI, eess.SP
Published: January 14, 2026
PDF: Download PDF

[Paper] Lean Clients, Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management