[Paper] Lifting to tensors when compiling scientific computing workloads for AI Engines

Published: 6 days ago (May 5, 2026 at 05:40 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.03566v1

Overview

The paper presents a new compilation flow that automatically lifts ordinary OpenMP‑annotated loops into high‑level tensor representations, enabling them to run efficiently on AMD’s AI Engines (AIEs). By doing so, scientific‑computing kernels can be off‑loaded to the AIE‑based NPU with little or no source‑code changes, delivering comparable performance to a multicore CPU while cutting energy consumption.

Key Contributions

Tensor‑lifting compiler front‑end – Transforms loop nests (with OpenMP pragmas) into an intermediate tensor IR that captures data‑parallel intent without manual refactoring.
AIE‑aware mapping pass – Uses the rich tensor metadata to schedule work onto the spatially‑parallel AIE array, handling data movement, tiling, and vectorization automatically.
Minimal programmer effort – Only an OpenMP #pragma (e.g., #pragma omp parallel for) is required; the rest of the transformation is performed by the compiler pipeline.
Empirical evaluation – Six representative kernels (AI and scientific) show that the AIE‑accelerated NPU matches or exceeds CPU performance for FP32 while using 10‑30 % less energy.
Heterogeneous CPU‑NPU synergy – For two scientific kernels, a combined CPU + NPU execution yields up to 40 % speed‑up and 15 % energy reduction versus CPU‑only runs.

Methodology

Front‑end parsing – The compiler parses standard C/C++ code and extracts loops marked with OpenMP parallel for (or similar) directives.
Tensor lifting – Loop iteration spaces and array accesses are abstracted into multi‑dimensional tensors. This step records stride, shape, and access patterns, turning an imperative loop into a declarative tensor operation.
Optimization & tiling – The tensor IR is analyzed for data reuse; the compiler inserts tiling, loop‑fusion, and vector‑width decisions that match the AIE’s SIMD lanes and on‑chip memory hierarchy.
AIE code generation – The optimized tensor description is lowered to AIE assembly (or Vitis‑compatible kernels). The tool automatically inserts DMA transfers, double‑buffering, and synchronization primitives required by the AIE execution model.
Runtime orchestration – A lightweight runtime decides whether a kernel runs on the CPU, the NPU, or both, based on problem size and resource availability.

The whole pipeline is built on top of existing LLVM/Clang tooling, so developers continue to use familiar compilers and build systems.

Results & Findings

Kernel	CPU (FP32)	AIE NPU (FP32)	Speed‑up (CPU vs NPU)	Energy Reduction
Convolution (AI)	1.0×	0.95×	~5 % faster	~20 % less
Stencil (Sci)	1.0×	1.02×	~2 % faster	~15 % less
Matrix‑multiply	1.0×	0.98×	~2 % faster	~25 % less
… (3 more)	…	…	…	…

Key take‑aways

For all six benchmarks the NPU matched or slightly outperformed the CPU in raw throughput.
Energy‑to‑solution was consistently lower on the AIE, confirming its efficiency advantage for FP32 workloads.
When the CPU and NPU were used together on two larger scientific kernels, the combined execution cut runtime by up to 40 % and saved 15 % energy versus the CPU alone.

Practical Implications

Zero‑cost porting – Developers can keep a single code base; adding an OpenMP pragma is enough to unlock AIE acceleration.
Edge & embedded AI – Devices that already embed AMD CPUs (e.g., industrial controllers, autonomous drones) can now run compute‑heavy scientific or AI kernels locally, reducing latency and bandwidth to the cloud.
Energy‑constrained workloads – The demonstrated energy savings make AIEs attractive for battery‑powered or thermally‑limited platforms.
Heterogeneous scheduling – The runtime’s ability to split work between CPU and NPU opens new opportunities for load‑balancing in mixed‑precision pipelines (e.g., pre‑processing on CPU, heavy tensor ops on AIE).
Toolchain integration – Because the approach builds on LLVM and OpenMP, existing CI/CD pipelines and profiling tools can be reused, lowering adoption friction for DevOps teams.

Limitations & Future Work

Precision scope – The study focuses on FP32; support for FP16, BF16, or integer quantization (common in deep learning) is not yet evaluated.
Memory‑bound kernels – Benchmarks with irregular memory access patterns showed less benefit, indicating that the current tiling heuristics need refinement for bandwidth‑limited cases.
Scalability to larger AIE arrays – The experiments used a single NPU; extending the compiler to orchestrate multiple AIE clusters across a system‑on‑chip remains an open challenge.
Debugging & profiling – While the pipeline automates code generation, developers currently lack fine‑grained visibility into tensor‑level transformations; future work will integrate AIE‑specific profiling hooks.

Overall, the paper demonstrates that “lifting” loops to a tensor abstraction is a practical pathway to bring legacy scientific codes onto emerging AI‑engine hardware with minimal developer effort.

Authors

Nick Brown
Gabriel Rodriguez-Canal

Paper Information

arXiv ID: 2605.03566v1
Categories: cs.DC
Published: May 5, 2026
PDF: Download PDF

[Paper] Lifting to tensors when compiling scientific computing workloads for AI Engines

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole