[Paper] BLEST: Blazingly Efficient BFS using Tensor Cores

Published: 1 month ago (December 26, 2025 at 05:30 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21967v1

Overview

The paper introduces BLEST, a novel framework that harnesses NVIDIA’s Tensor Cores—hardware originally designed for dense matrix‑multiply‑accumulate (MMA) operations—to accelerate the classic Breadth‑First Search (BFS) graph kernel. By re‑thinking the BFS pipeline and carefully mapping edge‑level work onto the bitwise capabilities of Tensor Cores, the authors achieve up to 5× speed‑up over state‑of‑the‑art GPU BFS libraries on a wide variety of real‑world graphs.

Key Contributions

Bitmap‑oriented pull‑based BFS: Reformulates BFS using a compact bitmap frontier, enabling efficient warp‑level parallelism.
Binarised Virtual Slice Sets (BVSS): A load‑balancing scheme that assigns work at the warp level, eliminating frontier‑oblivious (i.e., unnecessary) work.
Graph reordering heuristics:
- Compression‑oriented ordering for social‑network‑like graphs (high locality, many small degree vertices).
- Bandwidth‑reducing ordering for heterogeneous, non‑social graphs.
Batched SpMSpV on Tensor Cores: Implements a sparse‑matrix‑sparse‑vector multiplication pattern that exploits the bitwise TC tiles, dramatically cutting the number of MMA calls.
Kernel fusion + lazy vertex updates: Merges multiple BFS stages into a single kernel and postpones vertex updates until necessary, reducing host‑GPU synchronisation and atomic contention.

Methodology

Pull‑based BFS with bitmap frontier – Instead of the traditional push model (expanding from the current frontier), BLEST “pulls” by scanning all vertices and checking a bitmap that marks active frontier nodes. This representation is naturally suited to bitwise operations.
Warp‑level work slicing (BVSS) – The adjacency list of each vertex is split into virtual slices; each warp processes a slice that contains roughly the same number of active edges, guaranteeing balanced execution without per‑thread divergence.
Graph reordering –
- Compression‑oriented: Vertices are reordered to cluster high‑degree nodes together, improving the density of the bitmap and enabling better compression on Tensor Cores.
- Bandwidth‑reducing: A classic reverse‑Cuthill‑McKee style ordering minimizes edge‑cut bandwidth, which helps memory coalescing for irregular graphs.
Batched SpMSpV on Tensor Cores – The core computation is expressed as a series of bitwise dot‑products between a sparse adjacency matrix slice and the frontier bitmap. Each dot‑product maps to a TC tile (e.g., 8×8 bits) and is executed via a single MMA instruction, avoiding the creation of zero‑filled output entries that would otherwise waste cycles.
Kernel fusion & lazy updates – The BFS “visit”, “update distance”, and “frontier generation” steps are fused into one kernel launch. Vertex distance updates are deferred until the end of the kernel, which cuts down on atomic writes and improves L2 cache reuse.

Results & Findings

Baseline	Speedup (average)
BerryBees (GPU BFS library)	3.58×
Gunrock (GPU graph framework)	4.64×
GSWITCH (TC‑enabled BFS)	4.9×

Across 30+ real‑world graphs (social networks, web crawls, road maps, synthetic RMAT), BLEST consistently outperformed the baselines, with the largest gains (>6×) on graphs that exhibit high degree variance after reordering.
Memory footprint dropped by ~30 % thanks to bitmap frontiers and the compression‑oriented ordering.
Kernel launch overhead was reduced by ~45 % due to the fusion strategy, which also lowered the number of host‑device synchronisations per BFS level.

Practical Implications

Faster graph analytics pipelines – Applications that repeatedly run BFS (e.g., shortest‑path, community detection, reachability queries) can see near‑linear reductions in runtime on modern NVIDIA GPUs equipped with Tensor Cores (e.g., A100, H100).
Cost‑effective scaling – By extracting more performance per GPU, data‑center operators can handle larger graphs on fewer nodes, translating to lower electricity and hardware costs.
Reusable primitives – The batched SpMSpV pattern and BVSS load‑balancing logic are generic enough to be adapted for other sparse linear algebra kernels (e.g., PageRank, label propagation) that also suffer from irregular memory access patterns.
Integration path – Since BLEST builds on existing CUDA APIs (warp‑level primitives, __nv_bfloat16‑based MMA), it can be incorporated into existing graph libraries (Gunrock, cuGraph) as an optional “TC‑accelerated” backend without a complete rewrite.

Limitations & Future Work

Hardware dependency – BLEST’s gains rely on the presence of high‑throughput Tensor Cores; older GPUs or non‑NVIDIA accelerators cannot benefit.
Reordering overhead – The graph‑reordering step adds a preprocessing cost that may be non‑trivial for streaming or dynamically changing graphs.
Sparse‑matrix density requirement – Extremely sparse graphs with very low average degree still see modest speed‑ups because the bitwise TC tiles are under‑utilized.
Future directions suggested by the authors include: extending the approach to multi‑GPU clusters, exploring adaptive reordering that updates on‑the‑fly for evolving graphs, and generalising the SpMSpV‑TC pipeline to support weighted edges and other semirings.

Authors

Deniz Elbek
Kamer Kaya

Paper Information

arXiv ID: 2512.21967v1
Categories: cs.DC, cs.DS
Published: December 26, 2025
PDF: Download PDF

[Paper] BLEST: Blazingly Efficient BFS using Tensor Cores

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Proceedings First Workshop on Adaptable Cloud Architectures

[Paper] FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion

[Paper] Robust Federated Fine-Tuning in Heterogeneous Networks with Unreliable Connections: An Aggregation View

[Paper] LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices