[Paper] MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training
Source: arXiv - 2601.04707v1
Overview
Graph Neural Networks (GNNs) have become the go‑to tool for learning from graph‑structured data—think social networks, recommendation systems, or molecular graphs. However, training large GNNs on multiple GPUs is still painfully slow because the usual pipelines can’t overlap data loading, neighbor sampling, and model synchronization. The paper “MQ‑GNN: A Multi‑Queue Pipelined Architecture for Scalable and Efficient GNN Training” introduces a new runtime that interleaves these stages, delivering up to 4.6× faster training and 30 % higher GPU utilization without sacrificing model quality.
Key Contributions
- Multi‑Queue Pipelining: Introduces a set of independent queues that let mini‑batch generation, neighbor sampling, and GPU computation run concurrently.
- RaCoM (Ready‑to‑Update Asynchronous Consistent Model): An asynchronous gradient‑sharing scheme that keeps model parameters globally consistent via adaptive periodic synchronization.
- Global Neighbor Sampling + Caching: Moves sampling to a global stage and caches sampled sub‑graphs, dramatically cutting inter‑GPU data transfers.
- Adaptive Queue‑Sizing: Dynamically adjusts queue lengths based on runtime memory pressure and compute load, balancing throughput and memory footprint.
- Extensive Empirical Validation: Benchmarks on four massive graph datasets (e.g., ogbn‑products, Reddit) across ten popular GNN architectures, showing consistent speedups while preserving accuracy.
Methodology
-
Pipeline Decomposition – The training workflow is split into three logical stages:
- Sampling Stage: Performs global neighbor sampling once per epoch and stores the results in a shared cache.
- Batch Preparation Stage: Pulls cached sub‑graphs, assembles mini‑batches, and pushes them onto a ready‑to‑compute queue.
- Compute & Update Stage: Each GPU consumes batches, runs forward/backward passes, and emits gradients to a gradient‑exchange queue.
-
Multi‑Queue Engine – Each stage owns its own lock‑free queue. Workers (CPU threads for sampling/pre‑processing, GPU kernels for compute) operate independently, so while the GPU is crunching one batch, the CPU can already be preparing the next one.
-
RaCoM Synchronization – Instead of a heavyweight all‑reduce after every batch, workers push gradients to a central coordinator that aggregates them periodically (the period adapts to observed staleness vs. convergence). The model parameters are updated asynchronously on each GPU, but a lightweight consistency check guarantees that all replicas stay within a bounded divergence.
-
Adaptive Queue Sizing – The system monitors GPU memory usage and compute latency. If memory pressure rises, it shrinks the ready‑to‑compute queue; if GPUs idle, it expands the queue to keep them busy. This feedback loop runs every few seconds and requires no manual tuning.
-
Implementation Details – Built on top of PyTorch Geometric and NCCL for inter‑GPU communication, the authors expose a drop‑in API (
mqgnn.Trainer) that mirrors the familiartorch.nn.Moduletraining loop, making adoption painless for developers.
Results & Findings
| Dataset / Model | Baseline (e.g., DGL, PyG) | MQ‑GNN | Speedup | GPU Utilization ↑ | Accuracy Δ |
|---|---|---|---|---|---|
| ogbn‑products (GraphSAGE) | 12.4 h | 2.8 h | 4.4× | +28 % | ±0.1 % |
| Reddit (GAT) | 8.6 h | 2.1 h | 4.1× | +30 % | ±0.2 % |
| Protein‑large (GIN) | 6.9 h | 1.9 h | 3.6× | +25 % | ±0.0 % |
| Flickr (APPNP) | 4.3 h | 1.0 h | 4.3× | +30 % | ±0.1 % |
- Training time shrank by 3–4.6× across all tested models.
- GPU utilization rose from ~60 % (baseline) to ~85–90 % thanks to the overlapping pipeline.
- Model quality stayed within the statistical noise of the baseline, confirming that the asynchronous updates do not degrade convergence.
- Memory overhead grew modestly (≈10 % extra for caches), which the adaptive queue logic kept in check.
Practical Implications
- Faster Prototyping: Teams can iterate on GNN architectures in hours rather than days, accelerating research‑to‑production cycles.
- Cost Savings on Cloud: Higher GPU utilization translates directly into lower compute bills—especially relevant for large‑scale training on spot instances.
- Scalable Service Deployment: The multi‑queue design works equally well for inference pipelines that need to serve batched graph queries with low latency.
- Drop‑in Integration: Because MQ‑GNN builds on existing PyG/DGL APIs, existing codebases can adopt it with minimal refactoring—just replace the trainer class.
- Hardware‑agnostic Benefits: While the paper focuses on multi‑GPU servers, the same principles (asynchronous gradient aggregation, caching) can be applied to multi‑node clusters or even CPU‑only environments.
Limitations & Future Work
- Memory Footprint: The global neighbor cache can become large for extremely high‑degree graphs; future work could explore hierarchical or on‑the‑fly sampling to further prune memory usage.
- Synchronization Granularity: The adaptive period for RaCoM is heuristic; a more principled, perhaps learning‑based scheduler could improve convergence guarantees on highly non‑convex loss surfaces.
- Hardware Diversity: Experiments were limited to NVIDIA GPUs and NCCL; extending the runtime to AMD GPUs or TPU pods would broaden applicability.
- Dynamic Graphs: The current design assumes a static graph per epoch; handling rapidly evolving graphs (e.g., streaming social networks) remains an open challenge.
Overall, MQ‑GNN offers a pragmatic, high‑impact solution for anyone wrestling with the slow, resource‑starved training loops that currently plague large‑scale GNN projects. By rethinking the pipeline as a set of overlapping queues and embracing controlled asynchrony, it unlocks a new level of efficiency that developers can start leveraging today.
Authors
- Irfan Ullah
- Young‑Koo Lee
Paper Information
- arXiv ID: 2601.04707v1
- Categories: cs.LG, cs.AI, cs.DC, cs.PF
- Published: January 8, 2026
- PDF: Download PDF