[Paper] AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation
Source: arXiv - 2601.22760v1
Overview
Deep learning performance hinges on highly tuned kernels that run on specialized hardware. While large language models (LLMs) have shown promise in auto‑generating GPU kernels, doing the same for Huawei’s Ascend NPU has remained a hard problem because Ascend’s programming model (AscendC) is opaque and poorly documented. AscendCraft tackles this gap by introducing a lightweight domain‑specific language (DSL) that captures the essential semantics of Ascend kernels, then uses LLMs to “transcompile” DSL code into fully‑functional AscendC kernels.
Key Contributions
- DSL abstraction for AscendC – a concise, human‑readable language that hides low‑level boilerplate while exposing the execution semantics unique to Ascend NPUs.
- Two‑stage generation pipeline – (1) LLM produces DSL code from high‑level operator descriptions; (2) a constraint‑driven LLM lowering pass translates DSL into optimized AscendC.
- High success rates – 98.1 % of generated kernels compile, and 90.4 % pass functional correctness tests on a diverse benchmark suite (MultiKernelBench).
- Competitive performance – 46.2 % of the generated kernels meet or beat the runtime of PyTorch’s eager execution on the same hardware.
- Demonstrated extensibility – the system successfully generated kernels for a brand‑new “mHC” architecture, outperforming existing PyTorch implementations.
Methodology
- Designing the DSL – The authors distilled AscendC’s core concepts (tensor tiling, memory hierarchy, vector instructions) into a small set of high‑level primitives. The DSL is deliberately “lightweight” so that LLMs can learn it from a modest set of expert examples.
- Prompt engineering & example selection – For each operator category (e.g., convolution, matrix‑multiply, element‑wise), a curated collection of DSL snippets is fed to the LLM as few‑shot demonstrations. This guides the model to produce syntactically correct DSL code that respects Ascend’s execution model.
- Structured transcompilation – A second LLM pass receives the DSL output together with a set of formal constraints (e.g., register limits, vector width). It incrementally lowers the DSL into AscendC, inserting the necessary pragmas, memory copies, and loop nests while checking each step against the constraints.
- Automated verification – Generated AscendC kernels are compiled with the Ascend toolchain, executed on a real Ascend device, and compared against reference outputs to assess functional correctness. Performance is measured against PyTorch eager execution on identical inputs.
Results & Findings
| Metric | Value |
|---|---|
| Compilation success | 98.1 % |
| Functional correctness (passes reference tests) | 90.4 % |
| Kernels ≥ PyTorch eager performance | 46.2 % |
| Operators covered (7 categories) | Convolution, GEMM, Pooling, Element‑wise, Softmax, Normalization, Reduction |
| New architecture (mHC) kernels generated | 2 correct kernels, both substantially faster than PyTorch eager |
What this means:
- The DSL effectively bridges the “semantic gap” between high‑level operator intent and Ascend’s low‑level programming model.
- LLMs, when guided by a well‑structured DSL and constraint checks, can produce not only syntactically correct code but also performant kernels for a hardware platform that previously resisted automated generation.
- The approach scales across operator families and even adapts to novel hardware extensions without hand‑written kernels.
Practical Implications
- Accelerated kernel development cycles – Teams can prototype custom operators for Ascend NPUs in hours rather than weeks, freeing expert compiler engineers for higher‑level optimizations.
- Lower barrier to entry – Start‑ups and research labs lacking deep AscendC expertise can still tap the performance benefits of Ascend hardware by leveraging AscendCraft’s DSL + LLM workflow.
- Integration into CI pipelines – The high compilation success rate makes it feasible to automatically generate and test kernels as part of continuous integration, ensuring that new model variants always have an optimized implementation ready.
- Portability to other NPUs – The DSL‑first philosophy is hardware‑agnostic; with modest adjustments to the DSL’s primitives, the same pipeline could target other proprietary accelerators (e.g., Cambricon, Graphcore).
- Cost‑effective performance tuning – Since the generated kernels already beat eager PyTorch in many cases, developers can achieve near‑optimal inference speed without manual hand‑tuning, reducing cloud compute expenses.
Limitations & Future Work
- Coverage gaps – Although seven operator categories were tested, more exotic kernels (e.g., custom attention mechanisms) remain unverified.
- Performance ceiling – While 46 % of kernels match or exceed PyTorch eager, they still fall short of hand‑optimized kernels that exploit every micro‑architectural nuance.
- Reliance on prompt quality – The system’s success hinges on well‑crafted few‑shot examples; scaling to entirely new domains may require additional prompt engineering effort.
- Hardware‑specific constraints – The current constraint set is tailored to Ascend; extending to other NPUs will need new constraint models and possibly richer DSL constructs.
- Future directions include expanding the DSL to cover control‑flow constructs, integrating reinforcement‑learning‑based kernel tuning, and open‑sourcing the pipeline to foster community‑driven extensions.
Authors
- Zhongzhen Wen
- Shudi Shao
- Zhong Li
- Yu Ge
- Tongtong Xu
- Yuanyi Lin
- Tian Zhang
Paper Information
- arXiv ID: 2601.22760v1
- Categories: cs.DC, cs.LG, cs.PF, cs.SE
- Published: January 30, 2026
- PDF: Download PDF