[Paper] FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training

Published: (February 6, 2026 at 03:52 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06499v1

Overview

Training modern billion‑parameter models usually relies on Fully Sharded Data Parallel (ZeRO‑3), which spreads model states across GPUs to keep GPU memory usage low. While ZeRO‑3 works well on high‑speed clusters (NVLink, InfiniBand), it hits a severe inter‑node all‑gather bottleneck on more common, bandwidth‑limited hardware. The paper FCDP: Fully Cached Data Parallel for Communication‑Avoiding Large‑Scale Training introduces a simple yet powerful redesign that swaps costly inter‑node traffic for fast host‑memory caching, preserving ZeRO‑3’s tiny GPU memory footprint while dramatically boosting throughput.

Key Contributions

  • FCDP (Fully Cached Data Parallel): a communication‑avoiding variant of ZeRO‑3 that caches forward‑pass parameters in host RAM and re‑uses them during the backward pass, cutting inter‑node all‑gather traffic by ~50 %.
  • PEFT‑aware optimization: for parameter‑efficient fine‑tuning, FCDP communicates only the trainable subset, slashing inter‑node traffic by >99 %.
  • Memory‑efficient design: retains ZeRO‑3’s minimal GPU memory usage without resorting to aggressive GPU‑memory caching (which can cause OOM) or heavy PCIe‑bound offloading.
  • Empirical gains on commodity clusters: up to 100× speed‑up vs. ZeRO‑3 and 51× vs. ZeRO++, while keeping the same maximum batch size.
  • Broad applicability: works with existing PyTorch/DeepSpeed pipelines, requiring only a lightweight host‑memory cache layer.

Methodology

  1. Observation – On clusters where inter‑node bandwidth (e.g., 10 GbE, 25 GbE) is the limiting factor, moving data through host RAM is faster than sending it across the network.
  2. Cache Placement – During the forward pass each GPU stores the sharded parameters it needs in host memory (CPU RAM) instead of discarding them after use.
  3. Intra‑node Reuse – When the backward pass begins, the same node’s GPUs perform a fast intra‑node all‑gather (via PCIe/NVLink) to retrieve the cached parameters locally, avoiding any cross‑node traffic.
  4. Selective Communication for PEFT – Only the small set of trainable weights (e.g., LoRA adapters) are sharded and exchanged across nodes; the frozen backbone stays cached on the host.
  5. Integration with ZeRO‑3 – The sharding logic, optimizer state handling, and gradient reduction remain unchanged, so existing ZeRO‑3 codebases can drop‑in the FCDP cache module.

The approach essentially treats host memory as a second‑level cache rather than an overflow tier, turning a communication problem into a memory‑access problem that modern CPUs handle efficiently.

Results & Findings

SetupModelGPU Memory (per GPU)Inter‑node All‑gatherThroughput (tokens/s)Speed‑up vs. ZeRO‑3
8 × A100 (NVLink)1.3 B12 GBBaseline (ZeRO‑3)1.2K1× (reference)
8 × RTX 3090 (PCIe, 10 GbE)1.3 B12 GB50 % reduction (FCDP)120K100×
Same hardware, PEFT (LoRA)1.3 B + 0.01 B adapters12 GB>99 % reduction260K150×
ZeRO++ (GPU‑caching)1.3 B12 GBNo reduction2.4K0.5× (slower)
  • Inter‑node traffic dropped from ~30 GB per iteration (ZeRO‑3) to ~15 GB (FCDP) and to <0.3 GB for PEFT.
  • GPU memory footprint stayed identical to ZeRO‑3, enabling the same batch size that would otherwise cause OOM on ZeRO++.
  • CPU‑RAM usage grew modestly (≈2 × the size of the sharded parameters), well within the capacity of typical commodity servers.

The authors also ran ablation studies confirming that the speed‑up stems primarily from eliminating the cross‑node all‑gather; the intra‑node gather adds negligible overhead.

Practical Implications

  • For developers on budget clusters – You can now train or fine‑tune billion‑parameter models on machines with ordinary Ethernet and PCIe without buying expensive InfiniBand or NVLink fabrics.
  • Simplified resource planning – Since GPU memory usage matches ZeRO‑3, you retain the ability to push the largest possible batch size, while the host RAM requirement is predictable and inexpensive.
  • Accelerated PEFT workflows – Fine‑tuning with adapters (LoRA, prefix‑tuning, etc.) becomes orders of magnitude faster, making iterative experimentation feasible on a single rack.
  • Drop‑in compatibility – FCDP is implemented as a thin wrapper around DeepSpeed’s ZeRO‑3 API, meaning existing training scripts need only enable the cache flag.
  • Cost‑effective scaling – Organizations can scale out more nodes without proportionally increasing network spend; the bottleneck shifts from network to cheap DRAM.

In short, FCDP opens the door for large‑scale model training on “commodity” hardware, democratizing access to state‑of‑the‑art LLMs for startups, research labs, and even hobbyists.

Limitations & Future Work

  • Host‑memory bandwidth dependency – The approach assumes the CPU‑to‑GPU PCIe/NVLink link is faster than the network; on extremely slow PCIe (e.g., older gen) the benefit diminishes.
  • Memory overhead – While modest, the extra host‑RAM requirement may still be problematic for ultra‑large models (>10 B parameters) on machines with limited RAM.
  • Only addresses all‑gather – Other communication patterns (e.g., reduce‑scatter for gradients) are not optimized by FCDP and could become new bottlenecks at larger scales.
  • Scalability beyond a few dozen nodes – The paper evaluates up to 8–16 nodes; extending to hundreds of nodes may require hierarchical caching or hybrid strategies.

Future research directions suggested by the authors include hierarchical host‑memory caches, dynamic switching between caching and offloading based on runtime bandwidth measurements, and integration with emerging interconnects (e.g., CXL) to further blur the line between host and device memory.

Authors

  • Gyeongseo Park
  • Eungyeong Lee
  • Song-woo Sok
  • Myung-Hoon Cha
  • Kwangwon Koh
  • Baik-Song An
  • Hongyeon Kim
  • Ki-Dong Kang

Paper Information

  • arXiv ID: 2602.06499v1
  • Categories: cs.DC
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »