[Paper] PASTA: A Modular Program Analysis Tool Framework for Accelerators

Published: 3 days ago (February 25, 2026 at 11:51 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.22103v1

Overview

Modern software increasingly runs on specialized accelerators such as GPUs, TPUs, and custom AI chips. Analyzing the performance of these workloads is notoriously hard because each vendor exposes its own low‑level profiling APIs and deep‑learning frameworks (TensorFlow, PyTorch, etc.) have very different execution models. The paper “PASTA: A Modular Program Analysis Tool Framework for Accelerators” introduces a unified, low‑overhead framework that lets developers and researchers quickly build custom analysis tools without wrestling with vendor‑specific details.

Key Contributions

Unified abstraction layer over heterogeneous profiling interfaces (NVIDIA, AMD, etc.) and popular deep‑learning runtimes.
Modular architecture that separates data collection, event processing, and user‑defined analyses, enabling rapid prototyping of new tools.
GPU‑accelerated backend that reduces measurement overhead dramatically (up to 13 000× faster than traditional CPU‑based profilers).
Two concrete tools built on top of PASTA:
1. A deep‑learning workload characterizer that extracts layer‑wise compute/memory patterns.
2. A Unified Virtual Memory (UVM) optimization assistant that suggests paging‑policy tweaks.
Extensive evaluation on a suite of mainstream DL models (ResNet, BERT, GPT‑2, etc.) across single‑ and multi‑GPU configurations on both NVIDIA and AMD hardware.

Methodology

PASTA follows a three‑layer pipeline:

Instrumentation Layer – thin adapters wrap each vendor’s profiling API (e.g., NVIDIA CUPTI, AMD ROCm) and expose a common event stream (kernel launches, memory copies, synchronization points).
Data‑Processing Layer – a lightweight runtime, written in CUDA/HIP, aggregates raw events on the GPU itself, performing tasks such as timestamp alignment, event correlation, and statistical summarization.
Analysis Plug‑in Layer – developers implement plug‑ins in C++ or Python that consume the processed event stream via a simple SDK. The SDK supplies utilities for windowed aggregation, histogramming, and visualisation.

Because the heavy lifting (event merging, filtering) happens on the accelerator, the framework incurs only microseconds of overhead per kernel, unlike traditional CPU‑side profilers that stall the application to pull data.

Results & Findings

Overhead Reduction: In micro‑benchmarks, PASTA’s GPU‑backed collector added <0.02 % runtime overhead, compared with 10–30 % for tools like Nsight Systems or ROCm‑Profiler.
Speedup in Data Retrieval: The authors report up to 1.3 × 10⁴× faster extraction of per‑kernel metrics because data never leaves the GPU memory until the analysis phase.
Accuracy: Event timestamps matched hardware counters within ±0.5 µs, confirming that the abstraction does not sacrifice precision.
Tool Demonstrations:
- The workload characterizer identified that BERT’s attention layers are memory‑bound on AMD GPUs, prompting a 12 % speed‑up after kernel fusion.
- The UVM optimizer reduced page‑fault stalls by 27 % on a multi‑GPU training run of GPT‑2, translating to a 5 % overall training‑time reduction.

Practical Implications

Faster Iteration for Performance Engineers: Teams can prototype custom analyses (e.g., detecting kernel launch imbalance across GPUs) in hours rather than days, thanks to the plug‑in SDK.
Lower Cost of Profiling at Scale: Because PASTA’s overhead is negligible, it can be left enabled in production‑grade training pipelines, providing continuous performance telemetry without hurting throughput.
Cross‑Vendor Portability: A single codebase can profile both NVIDIA and AMD hardware, simplifying CI pipelines for heterogeneous clusters.
Enabling New Optimizations: The GPU‑resident data processing opens the door to real‑time feedback loops—e.g., an auto‑tuner that adjusts batch size or kernel launch parameters on‑the‑fly based on live metrics.

Limitations & Future Work

Scope Limited to GPUs: While the design is extensible, current adapters only cover NVIDIA (CUPTI) and AMD (ROCm). Extending to TPUs, FPGAs, or emerging AI accelerators will require additional low‑level bindings.
Learning Curve for Plug‑ins: Although the SDK is lightweight, developers still need familiarity with GPU programming (CUDA/HIP) to write high‑performance plug‑ins.
Static Analysis Not Covered: PASTA focuses on runtime profiling; integrating static code analysis (e.g., kernel source inspection) could provide a more holistic optimization pipeline.
Future Directions: The authors plan to open‑source the framework, add support for Intel oneAPI and emerging RISC‑V AI cores, and explore AI‑driven anomaly detection on the collected event streams.

Authors

Mao Lin
Hyeran Jeon
Keren Zhou

Paper Information

arXiv ID: 2602.22103v1
Categories: cs.DC, cs.PF
Published: February 25, 2026
PDF: Download PDF

[Paper] PASTA: A Modular Program Analysis Tool Framework for Accelerators

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploiting network topology in brain-scale simulations of spiking neural networks

[Paper] STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

[Paper] A High-Throughput AES-GCM Implementation on GPUs for Secure, Policy-Based Access to Massive Astronomical Catalogs

[Paper] A Simple Distributed Deterministic Planar Separator