[Paper] FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

Published: 4 days ago (May 7, 2026 at 07:41 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.06057v1

Overview

The paper introduces FalconGEMM, a cross‑platform framework that makes “lower‑complexity” matrix‑multiplication algorithms (LCMAs) practical for real‑world deep‑learning workloads. By automatically generating, optimizing, and selecting the best algorithm for a given hardware target, FalconGEMM consistently beats the performance of traditional GEMM libraries and even other LCMA solutions on GPUs and CPUs.

Key Contributions

Portable Deployment Module – Generates hardware‑specific code so the same LCMA can run on GPUs (H100, A100), ARM CPUs, and x86 CPUs without manual retuning.
Group‑Parallel Optimizations – Novel on‑chip data‑reuse and parallel‑resource scheduling that cuts memory bandwidth pressure and maximizes compute utilization.
Lightweight Decision Module – An analytical performance model that predicts the fastest algorithm for any matrix shape and hardware profile, enabling run‑time selection.
Comprehensive Evaluation – Demonstrates 7.6 %–17.9 % speedups over state‑of‑the‑art GEMM libraries (cuBLAS, CUTLASS, Intel MKL) and 12.4 %–55.6 % over competing LCMA frameworks such as AlphaTensor on large‑language‑model (LLM) training and inference tasks.

Methodology

Algorithm Catalog – The authors collect a suite of LCMAs (e.g., Strassen‑like, Winograd‑based, and newer tensor‑decomposition methods) that have lower arithmetic complexity than the classic O(n³) GEMM.
Code Generation – A deployment engine parses the target hardware’s instruction set, memory hierarchy, and parallel execution model, then emits optimized kernels (CUDA, HIP, AVX‑512, NEON, etc.).
Group‑Parallel Optimizations – Kernels are organized into “groups” that share intermediate results in on‑chip buffers (shared memory, L1 cache, or registers). This reduces the number of times the same sub‑product is fetched from DRAM.
Analytical Performance Model – The decision module estimates runtime by accounting for compute throughput, memory bandwidth, and the extra overhead of the LCMA’s recursion depth. It then picks the algorithm (and its tiling parameters) that minimizes the predicted time.
Run‑time Dispatch – At inference or training launch, FalconGEMM queries the model with the actual matrix dimensions and hardware stats, selects the best kernel, and launches it without developer intervention.

Results & Findings

Platform	Data Type	Speedup vs. cuBLAS / MKL	Speedup vs. AlphaTensor
NVIDIA H100	FP16	+15.2 %	+32.8 %
NVIDIA A100	BF16	+12.7 %	+28.4 %
ARM Neoverse	FP32	+9.3 %	+18.5 %
Intel Xeon (AVX‑512)	FP64	+7.6 %	+12.4 %

Peak‑breaking performance: In several LLM layers (e.g., transformer attention and feed‑forward blocks) FalconGEMM exceeds the theoretical peak FLOPs of the underlying hardware by exploiting algorithmic reductions in arithmetic work.
Robustness across shapes: The decision module correctly switches between classic GEMM (for small, square matrices) and LCMAs (for the tall‑skinny or wide‑short matrices typical in token‑wise operations).
Low overhead: The analytical model adds < 0.5 % runtime overhead, making the framework suitable for both batch training and latency‑critical inference.

Practical Implications

LLM Training Pipelines – Faster matrix multiplications translate directly into reduced GPU hours, lowering cloud costs for large‑scale model pre‑training.
Edge Inference – On ARM‑based servers or even mobile SoCs, FalconGEMM can deliver higher throughput for on‑device language models without sacrificing battery life.
Framework Integration – Because the deployment module emits standard CUDA/HIP/AVX kernels, existing deep‑learning libraries (PyTorch, TensorFlow, JAX) can drop‑in replace their GEMM calls with FalconGEMM kernels via a thin wrapper.
Hardware‑agnostic Optimization – Companies with heterogeneous fleets (GPU + CPU) no longer need separate hand‑tuned kernels; FalconGEMM automatically adapts, simplifying CI/CD pipelines for model deployment.

Limitations & Future Work

Numerical Stability – Some LCMAs (e.g., Strassen‑type) introduce extra rounding error; the paper notes a modest loss of precision for FP64, which may be unacceptable for certain scientific workloads.
Memory Footprint – Recursive algorithms require extra temporary buffers; on memory‑constrained devices this can limit the size of matrices that can be processed.
Model Generalization – The analytical performance model is calibrated on a set of representative GPUs/CPUs; extending it to emerging accelerators (TPUs, custom ASICs) will need additional profiling.
Future Directions – The authors plan to incorporate mixed‑precision auto‑tuning, explore adaptive recursion depth based on runtime error metrics, and open‑source the framework to foster community‑driven kernel extensions.

Authors

Honglin Zhu
Jiaping Cao
Jiang Shao
Siyuan Feng
Qian Qiu
Peng Chen
Xu Zhang
Yixian Zhou
Man Lung Yiu
Guang Ji
Minwen Deng
Wenxi Zhu
Jintao Meng

Paper Information

arXiv ID: 2605.06057v1
Categories: cs.DC, cs.MS
Published: May 7, 2026
PDF: Download PDF

[Paper] FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole