[Paper] GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
Source: arXiv - 2604.21290v1
Overview
The paper proposes GraphLeap, a new way to run Vision Graph Neural Networks (ViGs) much faster by breaking the tight coupling between graph construction and convolution. By “looking ahead” one layer when building the k‑nearest‑neighbor (kNN) graph, the authors can overlap graph construction with feature updates and map the whole pipeline onto an FPGA, achieving real‑time inference speeds that dwarf CPU and GPU baselines.
Key Contributions
- Decoupled graph construction – Introduces a one‑layer‑lookahead scheme that builds the graph for layer ℓ + 1 while simultaneously performing message passing on layer ℓ, eliminating the sequential bottleneck.
- FPGA accelerator architecture – Designs a streaming, layer‑pipelined accelerator that tightly couples a kNN engine with a feature‑update engine, exploiting node‑ and channel‑level parallelism without materializing full edge feature tensors.
- Accuracy‑preserving fine‑tuning – Shows that the slight accuracy loss from using stale features can be recovered with a few epochs of lightweight fine‑tuning.
- Comprehensive evaluation – Demonstrates up to 95.7× speedup over a high‑performance CPU and 8.5× over a modern GPU on both isotropic and pyramidal ViG models, using a Xilinx Alveo U280 board.
- First end‑to‑end ViG FPGA solution – Provides the first complete hardware‑software stack for Vision GNN inference, including RTL kernels, host drivers, and a high‑level synthesis (HLS) workflow.
Methodology
-
GraphLeap reformulation – In a conventional ViG, each layer ℓ first runs a kNN search on the current patch embeddings to produce a graph, then performs message passing on that graph. GraphLeap flips the order: while layer ℓ processes its messages, the hardware simultaneously runs a kNN search on the previous layer’s embeddings to generate the graph for layer ℓ + 1. This creates a pipeline where graph construction and convolution are overlapped.
-
Hardware pipeline design
- kNN Engine: Implements a distance‑computation tree that streams patch features and outputs neighbor indices on‑the‑fly.
- Message‑Passing Engine: Consumes the neighbor list and performs weighted aggregation (e.g., sum or attention) across channels, using a systolic array to exploit channel parallelism.
- Layer‑pipelining: Each ViG layer is instantiated as a separate stage; data flows from one stage to the next without intermediate DRAM writes, keeping latency low.
-
Fine‑tuning – After training the original ViG, the authors replace the graph‑construction schedule with GraphLeap and run a short (≤ 5 epochs) fine‑tuning pass on the same dataset to close the tiny accuracy gap.
Results & Findings
| Platform | Speedup vs. Baseline | Throughput (frames / s) | Top‑1 Accuracy (Δ) |
|---|---|---|---|
| CPU (Xeon 3.0 GHz) | ≈ 95.7× | 12 fps (ViG‑S) | –0.3 % |
| GPU (RTX 3080) | ≈ 8.5× | 68 fps (ViG‑S) | –0.2 % |
| FPGA (Alveo U280) | — | 85 fps (ViG‑S) | –0.2 % |
- Graph construction time drops from > 90 % of total inference time on CPUs/GPUs to < 10 % on the FPGA thanks to the overlapped pipeline.
- Resource utilization on the U280 stays under 80 % of LUTs and DSPs, leaving headroom for larger ViG variants.
- Energy efficiency improves by roughly 6‑7× compared with the GPU, making the solution attractive for edge or datacenter inference where power budgets matter.
Practical Implications
- Real‑time vision applications (e.g., autonomous drones, smart cameras) can now leverage the adaptive receptive fields of ViGs without sacrificing latency.
- Edge deployment: The FPGA design fits within a single accelerator card, eliminating the need for multiple GPUs or large CPUs, and can be integrated into existing PCIe‑based inference servers.
- Framework integration: Because GraphLeap only changes the schedule of graph construction, existing PyTorch or TensorFlow ViG models can be ported with minimal code changes, followed by a short fine‑tuning step.
- Scalable to larger graphs: The O(N²) kNN cost is mitigated by the streaming architecture; developers can increase patch resolution (more nodes) while still meeting real‑time constraints.
Limitations & Future Work
- Accuracy trade‑off: The look‑ahead approach relies on slightly stale features; while fine‑tuning recovers most of the loss, some highly sensitive tasks may still see a small dip.
- Hardware specificity: The current implementation targets Xilinx U280 (Vitis HLS). Porting to other FPGA families or ASICs will require redesign of the kNN engine’s memory hierarchy.
- Dynamic batch sizes: The pipeline assumes a fixed batch size per frame; handling variable‑size batches or multi‑stream inputs would need additional control logic.
- Extending to other GNN kernels: GraphLeap focuses on kNN‑based ViGs; future work could explore decoupling strategies for attention‑based or spectral GNNs, and integrate quantization or pruning for even higher efficiency.
Authors
- Anvitha Ramachandran
- Dhruv Parikh
- Viktor Prasanna
Paper Information
- arXiv ID: 2604.21290v1
- Categories: cs.CV, cs.DC
- Published: April 23, 2026
- PDF: Download PDF