[Paper] FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

Published: (May 7, 2026 at 07:41 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.06057v1

Overview

The paper introduces FalconGEMM, a cross‑platform framework that makes “lower‑complexity” matrix‑multiplication algorithms (LCMAs) practical for real‑world deep‑learning workloads. By automatically generating, optimizing, and selecting the best algorithm for a given hardware target, FalconGEMM consistently beats the performance of traditional GEMM libraries and even other LCMA solutions on GPUs and CPUs.

Key Contributions

  • Portable Deployment Module – Generates hardware‑specific code so the same LCMA can run on GPUs (H100, A100), ARM CPUs, and x86 CPUs without manual retuning.
  • Group‑Parallel Optimizations – Novel on‑chip data‑reuse and parallel‑resource scheduling that cuts memory bandwidth pressure and maximizes compute utilization.
  • Lightweight Decision Module – An analytical performance model that predicts the fastest algorithm for any matrix shape and hardware profile, enabling run‑time selection.
  • Comprehensive Evaluation – Demonstrates 7.6 %–17.9 % speedups over state‑of‑the‑art GEMM libraries (cuBLAS, CUTLASS, Intel MKL) and 12.4 %–55.6 % over competing LCMA frameworks such as AlphaTensor on large‑language‑model (LLM) training and inference tasks.

Methodology

  1. Algorithm Catalog – The authors collect a suite of LCMAs (e.g., Strassen‑like, Winograd‑based, and newer tensor‑decomposition methods) that have lower arithmetic complexity than the classic O(n³) GEMM.
  2. Code Generation – A deployment engine parses the target hardware’s instruction set, memory hierarchy, and parallel execution model, then emits optimized kernels (CUDA, HIP, AVX‑512, NEON, etc.).
  3. Group‑Parallel Optimizations – Kernels are organized into “groups” that share intermediate results in on‑chip buffers (shared memory, L1 cache, or registers). This reduces the number of times the same sub‑product is fetched from DRAM.
  4. Analytical Performance Model – The decision module estimates runtime by accounting for compute throughput, memory bandwidth, and the extra overhead of the LCMA’s recursion depth. It then picks the algorithm (and its tiling parameters) that minimizes the predicted time.
  5. Run‑time Dispatch – At inference or training launch, FalconGEMM queries the model with the actual matrix dimensions and hardware stats, selects the best kernel, and launches it without developer intervention.

Results & Findings

PlatformData TypeSpeedup vs. cuBLAS / MKLSpeedup vs. AlphaTensor
NVIDIA H100FP16+15.2 %+32.8 %
NVIDIA A100BF16+12.7 %+28.4 %
ARM NeoverseFP32+9.3 %+18.5 %
Intel Xeon (AVX‑512)FP64+7.6 %+12.4 %
  • Peak‑breaking performance: In several LLM layers (e.g., transformer attention and feed‑forward blocks) FalconGEMM exceeds the theoretical peak FLOPs of the underlying hardware by exploiting algorithmic reductions in arithmetic work.
  • Robustness across shapes: The decision module correctly switches between classic GEMM (for small, square matrices) and LCMAs (for the tall‑skinny or wide‑short matrices typical in token‑wise operations).
  • Low overhead: The analytical model adds < 0.5 % runtime overhead, making the framework suitable for both batch training and latency‑critical inference.

Practical Implications

  • LLM Training Pipelines – Faster matrix multiplications translate directly into reduced GPU hours, lowering cloud costs for large‑scale model pre‑training.
  • Edge Inference – On ARM‑based servers or even mobile SoCs, FalconGEMM can deliver higher throughput for on‑device language models without sacrificing battery life.
  • Framework Integration – Because the deployment module emits standard CUDA/HIP/AVX kernels, existing deep‑learning libraries (PyTorch, TensorFlow, JAX) can drop‑in replace their GEMM calls with FalconGEMM kernels via a thin wrapper.
  • Hardware‑agnostic Optimization – Companies with heterogeneous fleets (GPU + CPU) no longer need separate hand‑tuned kernels; FalconGEMM automatically adapts, simplifying CI/CD pipelines for model deployment.

Limitations & Future Work

  • Numerical Stability – Some LCMAs (e.g., Strassen‑type) introduce extra rounding error; the paper notes a modest loss of precision for FP64, which may be unacceptable for certain scientific workloads.
  • Memory Footprint – Recursive algorithms require extra temporary buffers; on memory‑constrained devices this can limit the size of matrices that can be processed.
  • Model Generalization – The analytical performance model is calibrated on a set of representative GPUs/CPUs; extending it to emerging accelerators (TPUs, custom ASICs) will need additional profiling.
  • Future Directions – The authors plan to incorporate mixed‑precision auto‑tuning, explore adaptive recursion depth based on runtime error metrics, and open‑source the framework to foster community‑driven kernel extensions.

Authors

  • Honglin Zhu
  • Jiaping Cao
  • Jiang Shao
  • Siyuan Feng
  • Qian Qiu
  • Peng Chen
  • Xu Zhang
  • Yixian Zhou
  • Man Lung Yiu
  • Guang Ji
  • Minwen Deng
  • Wenxi Zhu
  • Jintao Meng

Paper Information

  • arXiv ID: 2605.06057v1
  • Categories: cs.DC, cs.MS
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »