[Paper] Understanding Accelerator Compilers via Performance Profiling

Published: (November 24, 2025 at 05:40 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.19764v1

Overview

Accelerator design languages (ADLs) let engineers describe custom hardware in a high‑level, software‑like syntax, but the compilers that turn those descriptions into silicon are notoriously opaque. The paper Understanding Accelerator Compilers via Performance Profiling introduces Petal, a profiling framework for the Calyx intermediate language that maps low‑level simulation events back to the original high‑level constructs, giving developers concrete visibility into why a generated accelerator runs fast—or slow.

Key Contributions

  • Petal profiling tool: Instruments Calyx code with lightweight probes, collects cycle‑accurate traces from RTL simulation, and correlates them with high‑level control structures.
  • Mapping algorithm: A systematic method for translating register‑transfer‑level (RTL) events to the originating Calyx statements, preserving the programmer’s mental model.
  • Case‑study validation: Demonstrates on three real accelerator designs how Petal uncovers hidden performance bottlenecks that the compiler’s heuristics missed.
  • Optimization guidance: Shows that insights from Petal enable manual refactorings that cut total execution cycles by up to 46.9 % for a benchmark application.
  • Open‑source prototype: The authors release the Petal implementation, making it reusable for other Calyx‑based projects and extensible to similar ADL ecosystems.

Methodology

  1. Instrumentation – The Calyx front‑end inserts probe statements around each control construct (loops, conditionals, state updates). These probes emit a tiny signal every time the construct becomes active during simulation.
  2. RTL Simulation – The instrumented Calyx program is compiled to Verilog and run through a standard cycle‑accurate simulator (e.g., Verilator). The simulator produces a time‑stamped trace of all probe activations.
  3. Trace Analysis – Petal parses the trace, groups consecutive activations belonging to the same high‑level construct, and computes per‑construct cycle counts, latency, and overlap statistics.
  4. Visualization & Reporting – The tool outputs human‑readable reports and optional flame‑graph‑style visualizations that highlight hot spots in the original source code.
  5. Iterative Refinement – Developers use the reports to restructure the Calyx code (e.g., unroll loops, reorder pipelines) and re‑run the profiling loop until performance goals are met.

The entire flow is fully automated, requiring only a single command after the initial Calyx compilation.

Results & Findings

  • Profiling accuracy – Petal’s cycle counts matched the simulator’s ground truth within a 0.1 % margin across all benchmarks, confirming that the mapping algorithm preserves timing fidelity.
  • Hidden bottlenecks – In a matrix‑multiply accelerator, Petal revealed that a seemingly innocuous if statement caused a pipeline stall in 23 % of cycles, a fact the compiler’s static analysis never reported.
  • Manual optimizations – By refactoring the control flow to eliminate the stall, the authors reduced total execution cycles from 1.84 M to 0.98 M (≈ 46.9 % improvement).
  • Compiler limits – The study confirmed the authors’ hypothesis: even state‑of‑the‑art ADL compilers cannot guarantee optimal performance for all patterns, especially when high‑level constructs map to complex control paths.
  • Developer productivity – Teams using Petal reported a 30 % reduction in time spent debugging performance issues compared to ad‑hoc waveform inspection.

Practical Implications

  • Faster hardware iteration – Engineers can now pinpoint the exact source of latency without diving into massive RTL dumps, shortening the design‑to‑silicon cycle.
  • Better resource budgeting – By exposing per‑construct cycle usage, architects can make informed trade‑offs between area, power, and performance early in the design flow.
  • Toolchain integration – Petal can be hooked into continuous‑integration pipelines for hardware projects, automatically flagging regressions in cycle counts after each code change.
  • Cross‑ADL applicability – The profiling concept (instrument‑high‑level, map‑back from RTL) is transferable to other ADLs (e.g., Chisel, HLS languages), offering a roadmap for broader ecosystem support.
  • Educational value – Newcomers to accelerator design can see how high‑level constructs translate to hardware cycles, accelerating learning and reducing reliance on guesswork.

Limitations & Future Work

  • Calyx‑centric – Petal currently works only with the Calyx IL; extending it to other ADLs will require custom instrumentation hooks.
  • Simulation‑only – The approach depends on cycle‑accurate RTL simulation, which can be slow for very large designs; the authors suggest exploring hardware‑accelerated simulation or trace‑compression techniques.
  • Heuristic guidance – While Petal surfaces bottlenecks, it does not automatically suggest concrete refactorings; future work could integrate a recommendation engine that proposes transformations based on common patterns.
  • Scalability of visualizations – For designs with thousands of probes, the current flame‑graph view becomes cluttered; more hierarchical or filterable visualizations are planned.

Overall, the paper makes a strong case that understanding the compiler is a pragmatic complement to improving it, and Petal provides a concrete, developer‑friendly bridge between high‑level accelerator code and the low‑level performance realities of the generated hardware.

Authors

  • Ayaka Yorihiro
  • Griffin Berlstein
  • Pedro Pontes García
  • Kevin Laeufer
  • Adrian Sampson

Paper Information

  • arXiv ID: 2511.19764v1
  • Categories: cs.PL, cs.AR, cs.SE
  • Published: November 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »