[Paper] Faster Distributed Inference-Only Recommender Systems via Bounded Lag Synchronous Collectives

Published: (December 22, 2025 at 07:36 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.19342v1

Overview

Deep learning recommender models (DLRMs) power the personalized feeds you see on platforms like Netflix, TikTok, and Amazon. The paper introduces Bounded Lag Synchronous (BLS) all‑to‑all communication, a drop‑in replacement for the traditional all‑to‑all collective that lets slower nodes fall behind a configurable amount without stalling the whole inference pipeline—while preserving inference accuracy.

Key Contributions

  • BLS All‑to‑All Primitive: A new collective operation that bounds the lag of slower processes, turning a fully synchronous barrier into a tunable, partially asynchronous step.
  • PyTorch Distributed Backend Integration: Implementation of BLS as a native backend, enabling easy adoption in existing PyTorch DLRM codebases.
  • Empirical Evaluation on Real‑World DLRM Workloads: Demonstrates that BLS yields significant latency and throughput gains for unbalanced embedding lookups (irregular access patterns, heterogeneous hardware, or network jitter).
  • Preservation of Model Accuracy: Shows that, for inference‑only scenarios, the bounded lag does not affect the final recommendation scores.
  • Guidelines for When BLS Helps: Provides a clear decision matrix indicating that BLS shines in workloads with skewed embedding access or variable per‑process delays, but offers little benefit for perfectly balanced runs.

Methodology

  1. Problem Formalization – The authors model the distributed embedding lookup as an irregular all‑to‑allv operation where each worker sends a variable‑size list of embedding indices to every other worker.
  2. Bounded Lag Design – Instead of a global barrier after each all‑to‑all, each worker proceeds after it has received data from all but a configurable subset of peers. The lag bound (e.g., “allow up to 2 iterations of delay”) is enforced by a lightweight credit‑based token system.
  3. Implementation – The BLS primitive is built on top of NCCL/UCX primitives and exposed through a custom PyTorch torch.distributed backend. The reference DLRM inference code is modified only to replace the standard all_to_all call with bls_all_to_all.
  4. Experimental Setup – Experiments run on a cluster of 8–64 GPU nodes with realistic embedding table sizes (tens of GB) and synthetic/real click‑through datasets that induce varying degrees of access skew. Metrics captured: per‑batch latency, overall throughput (queries/sec), and recommendation accuracy (AUC).

Results & Findings

ScenarioBaseline (sync all‑to‑all)BLS All‑to‑allSpeed‑upAccuracy Impact
Balanced access (uniform embedding hits)1.12 ms / batch1.09 ms / batch~2.7 %None
Skewed access (10 % hot embeddings)2.84 ms / batch1.71 ms / batch1.66×None
Heterogeneous node latency (network jitter up to 30 ms)3.45 ms / batch1.92 ms / batch1.80×None
Large‑scale (64 GPUs, 4 TB table)5.23 ms / batch2.87 ms / batch1.82×None
  • Latency drops dramatically when some workers experience delays; the lag bound lets faster workers keep processing, effectively “masking” stragglers.
  • Throughput (queries per second) improves proportionally because the pipeline stays fuller.
  • Model quality (AUC, NDCG) remains identical to the synchronous baseline, confirming that bounded lag does not introduce stale embeddings that would affect inference results.

Practical Implications

  • Production Deployments – Companies can drop‑in the BLS backend to existing PyTorch DLRM services without retraining models, gaining up to ~2× inference speed on skewed workloads.
  • Cost Savings – Faster inference translates to lower GPU‑hour consumption and reduced latency SLAs, directly impacting cloud spend and user experience.
  • Robustness to Heterogeneity – In multi‑tenant clusters where some nodes are noisier (e.g., shared networking), BLS prevents a single noisy neighbor from throttling the whole recommendation pipeline.
  • Simplified Scaling – When scaling out to more GPUs, the probability of uneven embedding distribution rises; BLS automatically adapts, reducing the need for manual load‑balancing heuristics.
  • Open‑Source Integration – Because the primitive lives in the PyTorch distributed stack, developers can experiment with different lag bounds via a simple config flag, enabling rapid A/B testing.

Limitations & Future Work

  • Benefit Limited to Unbalanced Scenarios – In perfectly balanced inference runs, BLS offers marginal gains, so the extra implementation complexity may not be justified.
  • Static Lag Bound – The current design uses a fixed lag parameter; adaptive schemes that react to runtime metrics could further improve performance.
  • Training Not Covered – The authors focus on inference‑only pipelines; extending BLS to training (where gradient consistency matters) remains an open challenge.
  • Hardware Specificity – Experiments rely on NCCL/UCX; performance on alternative interconnects (e.g., RoCE, NVLink‑only clusters) needs validation.

Bottom line: Bounded Lag Synchronous all‑to‑all is a pragmatic, low‑effort optimization for large‑scale recommendation inference that can shave off milliseconds per request and double throughput when your embedding accesses are anything but uniform. Developers looking to squeeze more performance out of existing PyTorch DLRM deployments should give it a try.

Authors

  • Kiril Dichev
  • Filip Pawlowski
  • Albert‑Jan Yzelman

Paper Information

  • arXiv ID: 2512.19342v1
  • Categories: cs.DC, cs.LG
  • Published: December 22, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »