[Paper] Lean Clients, Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning
Source: arXiv - 2601.09076v1
Overview
This paper tackles a long‑standing bottleneck in Split Federated Learning (SFL): the heavy memory and compute load that edge devices must bear during back‑propagation. The authors introduce HERON‑SFL, a hybrid training scheme that swaps the client‑side gradient‑based (first‑order) updates for cheap zeroth‑order (ZO) approximations while keeping the server‑side updates first‑order. The result is a system that can train modern deep nets on thin devices with far less memory and compute, without sacrificing model accuracy.
Key Contributions
- Hybrid Optimization Framework – Combines zeroth‑order updates on clients with first‑order updates on the server, preserving overall training fidelity.
- Auxiliary Network‑Assisted ZO Updates – Uses lightweight “assistant” networks to generate perturbed forward passes, enabling gradient‑free updates that avoid activation caching.
- Theoretical Guarantees – Proves convergence rates that are independent of model dimensionality under a low‑effective‑rank assumption, sidestepping the usual curse of dimensionality in ZO methods.
- Empirical Validation – Demonstrates on ResNet image classification and language‑model fine‑tuning that HERON‑SFL attains benchmark accuracy while cutting client peak memory by up to 64 % and per‑step compute by up to 33 %.
- Scalability Blueprint – Shows how the approach can extend SFL to models previously out‑of‑reach for edge devices (e.g., larger CNNs, transformer‑based LMs).
Methodology
-
Split Architecture – The model is divided into a client‑side front‑end and a server‑side back‑end. The client processes raw data, forwards the intermediate activation (the “cut‑layer” output) to the server, and receives the server’s response for loss computation.
-
Zeroth‑Order Client Update
- Instead of back‑propagating through the client network, each client samples a small set of random perturbations $\delta$ and evaluates the loss only forward with $\mathbf{x} + \delta$.
- Using a finite‑difference estimator (e.g., two‑point or Gaussian smoothing), the client builds an approximate gradient for its parameters without ever storing activations.
- An auxiliary network, much smaller than the main client model, generates the perturbations and provides a low‑cost way to compute the ZO step.
-
First‑Order Server Update
- The server receives the intermediate activations, computes the true gradient for its own (larger) back‑end, and performs a standard SGD/Adam step.
- Server‑side updates are communicated back to clients, completing one global iteration.
-
Hybrid Loop – The training loop alternates: clients perform cheap ZO steps locally; the server aggregates and applies FO updates. The process repeats until convergence.
The low‑effective‑rank assumption (the Jacobian of the loss w.r.t. client parameters lies in a low‑dimensional subspace) lets the authors bound the variance of the ZO estimator, yielding a convergence rate that does not blow up with the number of parameters.
Results & Findings
| Task | Model | Baseline (FO‑SFL) | HERON‑SFL | Memory Reduction | Compute Reduction |
|---|---|---|---|---|---|
| Image classification (CIFAR‑10) | ResNet‑18 | 93.2 % acc | 93.0 % acc | ↓ 64 % | ↓ 33 % |
| LM fine‑tuning (GPT‑2 small) | GPT‑2‑124M | 84.5 % ppl | 84.3 % ppl | ↓ 58 % | ↓ 30 % |
- Accuracy Parity – Across all benchmarks, HERON‑SFL stays within 0.2 % of the full first‑order baseline.
- Memory Footprint – Peak client memory drops dramatically because no activation maps need to be stored for back‑propagation.
- Compute Savings – Each client step requires only forward passes (plus cheap perturbation generation), cutting FLOPs per iteration.
- Scalability Test – Training a ResNet‑50 (≈25 M parameters) on a Raspberry Pi‑class device becomes feasible, whereas vanilla SFL crashes due to OOM.
Ablation studies show that auxiliary‑network size and the number of ZO samples trade off overhead against estimator variance.
Practical Implications
- Edge‑AI Deployment – Companies can now push more sophisticated models (e.g., vision transformers, medium‑size LMs) to IoT devices, wearables, or smartphones without redesigning the model architecture.
- Reduced Bandwidth Costs – Fewer backward messages mean lower uplink traffic, which is crucial for cellular or satellite‑connected devices.
- Energy Efficiency – Forward‑only computation consumes less power, extending battery life for on‑device learning scenarios (personalization, continual learning).
- Simplified SDKs – Developers can integrate HERON‑SFL into existing federated‑learning frameworks (TensorFlow Federated, PySyft) with minimal changes: swap the client optimizer for a ZO wrapper.
- Regulatory & Privacy Benefits – By keeping more computation on‑device and limiting gradient leakage, the approach aligns well with privacy‑by‑design mandates (GDPR, HIPAA).
Limitations & Future Work
- Zeroth‑Order Variance – Although the low‑rank assumption mitigates it, ZO estimators still introduce extra stochasticity, which may affect convergence speed on highly non‑convex tasks.
- Auxiliary Network Overhead – The helper network adds parameters and inference cost; finding the optimal size for diverse hardware remains an open engineering problem.
- Server Load – The server still performs full back‑propagation for the back‑end, which could become a bottleneck in massive‑scale deployments.
- Theoretical Scope – The convergence proof assumes smooth losses and bounded perturbations; extending it to non‑smooth objectives (e.g., quantized models) is future work.
- Broader Benchmarks – Experiments focus on image classification and LM fine‑tuning; evaluating HERON‑SFL on reinforcement learning, graph neural networks, or multimodal models would strengthen its generality.
Authors
- Zhoubin Kou
- Zihan Chen
- Jing Yang
- Cong Shen
Paper Information
- arXiv ID: 2601.09076v1
- Categories: cs.LG, cs.DC, cs.IT, cs.NI, eess.SP
- Published: January 14, 2026
- PDF: Download PDF