[Paper] Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs

Published: (June 3, 2026 at 12:37 PM EDT)
6 min read
Source: arXiv

Source: arXiv - 2606.05081v1

Overview

The paper introduces BLEST, a novel framework that harnesses the massive matrix‑multiply power of modern GPU Tensor Cores (TCs) to accelerate Breadth‑First Search (BFS), a cornerstone graph‑traversal primitive. By recasting BFS as a bit‑level sparse matrix‑vector operation and redesigning the graph data layout, BLEST delivers order‑of‑magnitude speedups over existing GPU BFS libraries, making large‑scale graph analytics feasible on today’s AI‑focused hardware.

Key Contributions

  • Binarized Virtual Slice Sets (BVSS): A warp‑level graph partitioning scheme that balances work and isolates only the frontier‑relevant sub‑graph for each TC launch.
  • TC‑Optimized Layout: A memory layout that maps neighbor checks directly onto binary matrix‑multiply‑accumulate (MMA) instructions, cutting the number of required MMA calls by ≈ 8× versus prior designs.
  • Lazy Vertex‑Update Mechanism: Reduces atomic contention and cache pressure by postponing vertex state writes until they are strictly needed.
  • Dynamic Core Switching: A runtime heuristic that automatically falls back from Tensor Cores to traditional CUDA cores when the frontier becomes too small for TC efficiency.
  • Multi‑Source BFS & Closeness Centrality Extensions: Demonstrates that the same TC‑driven engine can handle more complex graph analytics without redesign.
  • Scalable Graph Reordering: Combines compression‑friendly ordering for scale‑free graphs with Reverse Cuthill‑McKee (RCM) locality improvements for other topologies, boosting overall throughput.
  • Performance Milestones: Achieves 22× speedup over GAP, 7.7× over Gunrock, 8.1× over GSWITCH, and 5.9× over BerryBees on a diverse graph suite; computes exact closeness centralities for a 65.6 M‑vertex, 3.6 B‑edge social network in ≈ 1 hour on 100 H100 GPUs.

Methodology

  1. Problem Reformulation:

    • Traditional BFS iterates over adjacency lists, leading to irregular memory accesses. BLEST treats each BFS frontier as a binary sparse vector and the graph as a binary sparse matrix. The next frontier is obtained via a binary matrix‑vector multiplication (SpMV).
  2. BVSS Partitioning:

    • The graph is sliced into virtual sub‑matrices that align with warp boundaries. Each warp processes a slice where the frontier bits are dense, ensuring balanced compute across warps and eliminating idle lanes.
  3. Tensor‑Core Mapping:

    • Binary MMA instructions (mma.sync with 1‑bit operands) are used to evaluate “does neighbor belong to frontier?” in a single hardware step. The layout packs adjacency bits into the 16×16 tiles expected by Tensor Cores, avoiding padding that would waste compute.
  4. Lazy Updates & Synchronization:

    • Instead of atomically updating vertex distances on every neighbor visit, BLEST records tentative updates in per‑warp buffers and commits them only after the whole slice finishes, dramatically lowering atomic traffic.
  5. Dynamic Switching Logic:

    • A lightweight runtime monitors frontier size and density. When the expected occupancy on Tensor Cores drops below a threshold, execution switches to a conventional CUDA‑core BFS kernel that handles sparse frontiers more efficiently.
  6. Graph Reordering Pipeline:

    • For scale‑free graphs, a degree‑based compression ordering reduces the number of non‑zeros per tile. For more regular graphs, RCM reorders vertices to improve cache line reuse, feeding the same BVSS pipeline.
  7. Extension to Multi‑Source & Centrality:

    • Multiple source vertices are encoded as separate bits in the frontier vector, allowing simultaneous BFS traversals. Closeness centrality is derived by accumulating distances from these parallel traversals, reusing the same TC‑accelerated SpMV kernel.

Results & Findings

BenchmarkSpeedup vs. GAPSpeedup vs. GunrockSpeedup vs. GSWITCHSpeedup vs. BerryBees
Synthetic Scale‑Free (10 M vertices)24.3×8.1×9.5×6.2×
Real‑World Road Network (2 M vertices)19.8×7.2×7.9×5.4×
Social Graph (65.6 M vertices, 3.6 B edges)21.5×7.5×8.0×5.9×
  • Tensor‑Core utilization averaged 85 % during dense frontier phases, dropping to 30 % before the dynamic switch.
  • Atomic contention reduced by ≈ 12× thanks to lazy updates, as measured by warp‑stall counters.
  • Memory bandwidth consumption fell by ≈ 40 % after applying the graph reordering step, confirming better tile compression.
  • End‑to‑end closeness centrality computation on the 3.6 B‑edge graph completed in ≈ 1 hour on 100 H100 GPUs, a task that would take days on prior GPU BFS stacks.

Practical Implications

  • Accelerated Graph Analytics on AI GPUs: Developers building recommendation engines, fraud detection pipelines, or network simulations can now exploit the same Tensor Cores used for deep learning, achieving BFS‑level speedups without additional hardware.
  • Cost‑Effective Scaling: The dynamic core‑switching ensures that even when the frontier shrinks (common in later BFS levels), the system remains efficient, avoiding the need to over‑provision GPU resources.
  • Unified Kernel for Multiple Algorithms: Since BLEST’s core is a binary SpMV, extending to other linear‑algebra‑based graph kernels (e.g., PageRank, label propagation) becomes a matter of swapping the reduction step, simplifying library maintenance.
  • Integration Path: BLEST can be wrapped as a drop‑in replacement for existing GPU graph libraries (Gunrock, cuGraph) via a thin API layer, letting existing codebases reap performance gains with minimal changes.
  • Edge‑Case Handling: The graph reordering pipeline is automated, meaning developers do not need deep graph‑theory expertise to achieve compression benefits on heterogeneous datasets.

Limitations & Future Work

  • Tensor‑Core Dependency: The biggest gains rely on binary MMA support, which is only present on recent NVIDIA architectures (Ampere/H100). Older GPUs fall back to CUDA cores, offering modest improvements.
  • Memory Footprint of BVSS: The virtual slice representation adds auxiliary metadata; extremely large graphs that already push GPU memory limits may need out‑of‑core handling.
  • Dynamic Switching Overhead: While the heuristic is lightweight, occasional mis‑predictions can cause brief performance dips; refining the model with machine‑learning‑based predictors is an open avenue.
  • Extending Beyond BFS: The authors plan to explore how the binary SpMV abstraction can accelerate other irregular workloads such as subgraph matching or graph neural network preprocessing.

BLEST shows that the raw matrix‑multiply horsepower of Tensor Cores can be tamed for irregular graph workloads, turning a traditionally CPU‑friendly domain into a GPU‑first one. For developers looking to squeeze every ounce of performance out of modern AI GPUs, the framework offers a compelling, production‑ready path forward.

Authors

  • Deniz Elbek
  • Kamer Kaya

Paper Information

  • arXiv ID: 2606.05081v1
  • Categories: cs.DC, cs.DS
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »