[Paper] PASTA: A Modular Program Analysis Tool Framework for Accelerators
Source: arXiv - 2602.22103v1
Overview
Modern software increasingly runs on specialized accelerators such as GPUs, TPUs, and custom AI chips. Analyzing the performance of these workloads is notoriously hard because each vendor exposes its own low‑level profiling APIs and deep‑learning frameworks (TensorFlow, PyTorch, etc.) have very different execution models. The paper “PASTA: A Modular Program Analysis Tool Framework for Accelerators” introduces a unified, low‑overhead framework that lets developers and researchers quickly build custom analysis tools without wrestling with vendor‑specific details.
Key Contributions
- Unified abstraction layer over heterogeneous profiling interfaces (NVIDIA, AMD, etc.) and popular deep‑learning runtimes.
- Modular architecture that separates data collection, event processing, and user‑defined analyses, enabling rapid prototyping of new tools.
- GPU‑accelerated backend that reduces measurement overhead dramatically (up to 13 000× faster than traditional CPU‑based profilers).
- Two concrete tools built on top of PASTA:
- A deep‑learning workload characterizer that extracts layer‑wise compute/memory patterns.
- A Unified Virtual Memory (UVM) optimization assistant that suggests paging‑policy tweaks.
- Extensive evaluation on a suite of mainstream DL models (ResNet, BERT, GPT‑2, etc.) across single‑ and multi‑GPU configurations on both NVIDIA and AMD hardware.
Methodology
PASTA follows a three‑layer pipeline:
- Instrumentation Layer – thin adapters wrap each vendor’s profiling API (e.g., NVIDIA CUPTI, AMD ROCm) and expose a common event stream (kernel launches, memory copies, synchronization points).
- Data‑Processing Layer – a lightweight runtime, written in CUDA/HIP, aggregates raw events on the GPU itself, performing tasks such as timestamp alignment, event correlation, and statistical summarization.
- Analysis Plug‑in Layer – developers implement plug‑ins in C++ or Python that consume the processed event stream via a simple SDK. The SDK supplies utilities for windowed aggregation, histogramming, and visualisation.
Because the heavy lifting (event merging, filtering) happens on the accelerator, the framework incurs only microseconds of overhead per kernel, unlike traditional CPU‑side profilers that stall the application to pull data.
Results & Findings
- Overhead Reduction: In micro‑benchmarks, PASTA’s GPU‑backed collector added <0.02 % runtime overhead, compared with 10–30 % for tools like Nsight Systems or ROCm‑Profiler.
- Speedup in Data Retrieval: The authors report up to 1.3 × 10⁴× faster extraction of per‑kernel metrics because data never leaves the GPU memory until the analysis phase.
- Accuracy: Event timestamps matched hardware counters within ±0.5 µs, confirming that the abstraction does not sacrifice precision.
- Tool Demonstrations:
- The workload characterizer identified that BERT’s attention layers are memory‑bound on AMD GPUs, prompting a 12 % speed‑up after kernel fusion.
- The UVM optimizer reduced page‑fault stalls by 27 % on a multi‑GPU training run of GPT‑2, translating to a 5 % overall training‑time reduction.
Practical Implications
- Faster Iteration for Performance Engineers: Teams can prototype custom analyses (e.g., detecting kernel launch imbalance across GPUs) in hours rather than days, thanks to the plug‑in SDK.
- Lower Cost of Profiling at Scale: Because PASTA’s overhead is negligible, it can be left enabled in production‑grade training pipelines, providing continuous performance telemetry without hurting throughput.
- Cross‑Vendor Portability: A single codebase can profile both NVIDIA and AMD hardware, simplifying CI pipelines for heterogeneous clusters.
- Enabling New Optimizations: The GPU‑resident data processing opens the door to real‑time feedback loops—e.g., an auto‑tuner that adjusts batch size or kernel launch parameters on‑the‑fly based on live metrics.
Limitations & Future Work
- Scope Limited to GPUs: While the design is extensible, current adapters only cover NVIDIA (CUPTI) and AMD (ROCm). Extending to TPUs, FPGAs, or emerging AI accelerators will require additional low‑level bindings.
- Learning Curve for Plug‑ins: Although the SDK is lightweight, developers still need familiarity with GPU programming (CUDA/HIP) to write high‑performance plug‑ins.
- Static Analysis Not Covered: PASTA focuses on runtime profiling; integrating static code analysis (e.g., kernel source inspection) could provide a more holistic optimization pipeline.
- Future Directions: The authors plan to open‑source the framework, add support for Intel oneAPI and emerging RISC‑V AI cores, and explore AI‑driven anomaly detection on the collected event streams.
Authors
- Mao Lin
- Hyeran Jeon
- Keren Zhou
Paper Information
- arXiv ID: 2602.22103v1
- Categories: cs.DC, cs.PF
- Published: February 25, 2026
- PDF: Download PDF