[Paper] A High-level Synthesis Toolchain for the Julia Language

Published: (December 17, 2025 at 01:32 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.15679v1

Overview

The paper presents a first‑of‑its‑kind compiler toolchain that turns Julia code directly into synthesizable SystemVerilog, enabling developers to target FPGAs without learning a hardware description language or sprinkling pragmas throughout their source. Built on the MLIR infrastructure, the flow bridges the “two‑language problem” that has long hampered rapid FPGA prototyping, letting algorithm designers stay in the high‑level, expressive Julia ecosystem while still harvesting the performance and energy benefits of custom silicon.

Key Contributions

  • MLIR‑based Julia‑to‑RTL compiler: A fully automated pipeline that lowers high‑level Julia kernels to vendor‑agnostic SystemVerilog, requiring no extra directives or language extensions.
  • Support for dynamic and static scheduling: The toolchain can emit hardware that either schedules operations at compile‑time (static) or decides at run‑time (dynamic), covering a wide range of algorithmic patterns.
  • AXI4‑Stream integration out‑of‑the‑box: Generated RTL includes ready‑to‑use AXI4‑Stream interfaces, simplifying connection to on‑chip memories, DMA engines, or other IP blocks.
  • Competitive performance: Benchmarks run at ~100 MHz on real FPGA boards, achieving 60 %–83 % of the throughput of hand‑tuned C/C++‑based HLS flows.
  • Vendor‑agnostic RTL output: The generated SystemVerilog can be fed to any major FPGA vendor’s synthesis tools (Xilinx, Intel, Lattice, etc.).

Methodology

  1. Front‑end parsing – Julia source is parsed using the existing Julia compiler front‑end, preserving high‑level type information and multiple dispatch semantics.
  2. MLIR lowering – The parsed Julia IR is translated into a series of MLIR dialects (standard, affine, and a custom “Julia‑HLS” dialect) that capture loops, memory accesses, and data‑flow.
  3. Scheduling & optimization – The MLIR passes perform loop transformations (tiling, unrolling), data‑dependency analysis, and decide whether a kernel should be statically scheduled (fixed pipeline) or dynamically scheduled (runtime control logic).
  4. Hardware emission – Optimized MLIR is lowered to a SystemVerilog dialect, automatically inserting AXI4‑Stream handshaking logic and generating RTL modules for arithmetic units, buffers, and control FSMs.
  5. Verification & synthesis – The produced SystemVerilog is compiled with vendor tools (e.g., Xilinx Vivado) to obtain timing‑closed designs; functional correctness is validated against software reference models.

The entire flow is scripted, so a Julia developer can run a single command (julia --project=fpga my_kernel.jl) and obtain a synthesizable hardware design.

Results & Findings

BenchmarkFrequency (MHz)Throughput vs. C/C++ HLS*
FIR filter10082.6 %
FFT (radix‑2)9878.4 %
Matrix‑vector multiply10171.2 %
Polynomial evaluation9959.7 %

*Throughput measured as operations per second on the same FPGA device, using a state‑of‑the‑art C/C++ HLS toolchain (Xilinx Vitis HLS).

Key observations

  • No manual pragmas were needed; the compiler inferred pipeline depths and memory partitioning automatically.
  • The generated RTL met timing closure on mid‑range devices (e.g., Xilinx Artix‑7) with modest resource usage (≈30 % of LUTs for the larger kernels).
  • Dynamic scheduling kernels incurred a modest overhead (≈10 % lower throughput) but offered flexibility for data‑dependent workloads.

Practical Implications

  • Rapid prototyping: Data scientists and algorithm engineers can iterate on kernel logic in Julia, instantly see hardware performance estimates, and push the design to silicon without a separate HDL team.
  • Unified code base: Projects that already use Julia for simulation, testing, or GPU acceleration can now add an FPGA target without duplicating code or maintaining separate C/C++ kernels.
  • Lower barrier to FPGA adoption: Small‑to‑medium companies, startups, and research labs can explore custom accelerators without hiring specialized RTL engineers, accelerating time‑to‑market for AI inference, signal processing, and scientific computing workloads.
  • Interoperability: Since the output is plain SystemVerilog with AXI4‑Stream interfaces, the kernels can be integrated into existing FPGA designs, mixed with vendor IP, or composed into larger SoC pipelines.

Limitations & Future Work

  • Performance gap: While 60 %–83 % of C/C++ HLS throughput is impressive, certain latency‑critical kernels still lag behind hand‑optimized designs.
  • Resource overhead: The generic hardware generation sometimes over‑allocates buffers or uses conservative pipeline depths, leading to higher LUT/BRAM usage than a manually tuned implementation.
  • Toolchain maturity: The current prototype supports a subset of Julia’s language features (no metaprogramming, limited support for complex data structures). Extending coverage to the full language will be necessary for broader adoption.
  • Dynamic scheduling cost: Runtime control logic adds latency; future work will explore hybrid static/dynamic scheduling heuristics to reduce this overhead.
  • Verification ecosystem: Automated formal verification of the generated RTL against the original Julia semantics is still an open research direction.

The authors plan to broaden the benchmark suite, integrate more aggressive MLIR optimizations, and open‑source the toolchain to foster community contributions.

Authors

  • Benedict Short
  • Ian McInerney
  • John Wickerson

Paper Information

  • arXiv ID: 2512.15679v1
  • Categories: cs.SE, cs.AR, cs.PL
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »