[Paper] VecIntrinBench: Benchmarking Cross-Architecture Intrinsic Code Migration for RISC-V Vector
Source: arXiv - 2511.18867v1
Overview
The paper introduces VecIntrinBench, the first benchmark suite that evaluates how well code‑migration tools can translate intrinsic functions from mainstream SIMD extensions (x86 AVX, Arm Neon) to the emerging RISC‑V Vector (RVV) ISA. By providing 50 real‑world function‑level tasks with scalar, RVV, Neon, and x86 implementations, the authors give developers a concrete yardstick for measuring rule‑based and LLM‑based migration approaches.
Key Contributions
- A novel benchmark suite (VecIntrinBench) covering 50 open‑source kernels, each implemented in scalar C, RVV intrinsics, Arm Neon intrinsics, and x86 intrinsics.
- Comprehensive test harness that checks functional correctness and captures detailed performance metrics (throughput, latency, vector‑width utilization).
- Systematic evaluation of two migration strategies: (1) rule‑based intrinsic mapping, and (2) large‑language‑model (LLM) code generation (e.g., GPT‑4, Claude).
- Empirical evidence that state‑of‑the‑art LLMs match rule‑based tools in correctness while often delivering 10‑30 % higher performance on RVV targets.
- Open‑source release of the benchmark, test scripts, and migration pipelines to foster community contributions and reproducible research.
Methodology
- Task selection – The authors mined popular high‑performance libraries (e.g., OpenCV, Eigen, FFTW) and extracted 50 representative functions that heavily rely on SIMD intrinsics.
- Multi‑implementation – For each function they wrote four versions: a plain scalar baseline, an RVV‑intrinsic version, an Arm Neon version, and an x86‑intrinsic version. All implementations follow the same algorithmic logic to isolate the effect of the intrinsics themselves.
- Migration pipelines
- Rule‑based: A handcrafted mapping table that translates known x86/Neon intrinsics to their RVV equivalents, plus a thin wrapper that adjusts vector‑length parameters.
- LLM‑based: Prompts the selected LLM with the source intrinsic code and asks it to emit an RVV version; the output is then auto‑formatted and compiled.
- Testing framework – Each generated RVV version is compiled with the latest RISC‑V GCC/Clang, run on an RVV‑simulator (Spike) and on a real RISC‑V board (e.g., SiFive Freedom U740). Functional correctness is verified against the scalar baseline, and performance is measured using hardware counters and wall‑clock timing.
Results & Findings
| Migration method | Correctness (✓/✗) | Avg. speed‑up vs. scalar | Avg. speed‑up vs. rule‑based |
|---|---|---|---|
| Rule‑based | 94 % | 3.2× | — |
| LLM (GPT‑4) | 96 % | 3.8× | +18 % |
| LLM (Claude) | 95 % | 3.6× | +12 % |
- Correctness: Both approaches produced functionally correct RVV code for the majority of tasks; the few failures were traced to edge‑case intrinsics lacking a direct RVV counterpart.
- Performance: LLM‑generated code consistently chose more aggressive vector lengths and better memory‑access patterns (e.g., loop unrolling, prefetch hints), yielding up to 30 % higher throughput on compute‑bound kernels.
- Developer effort: LLMs required only a short prompt per function, whereas rule‑based mapping demanded extensive hand‑crafted tables and manual tuning for each new intrinsic.
Practical Implications
- Accelerated porting – Companies looking to bring existing x86/ARM‑optimized libraries to RISC‑V can leverage LLMs as a first‑pass migration tool, dramatically cutting the time from weeks to hours.
- Performance‑critical workloads – The benchmark shows that LLM‑generated RVV code can already meet or exceed hand‑tuned rule‑based implementations, making it viable for high‑frequency trading, AI inference, and signal‑processing pipelines on RISC‑V edge devices.
- Toolchain integration – The open‑source VecIntrinBench can be plugged into CI pipelines to automatically validate new RVV intrinsics or to benchmark upcoming compiler releases (e.g., GCC 13, LLVM 18).
- Ecosystem growth – By providing a common yardstick, the community can now compare different migration strategies, drive improvements in compiler auto‑vectorizers, and encourage hardware vendors to expose richer RVV intrinsics.
Limitations & Future Work
- Hardware coverage – Tests were performed on a single RVV implementation (VLEN = 256 bits). Wider vector lengths or future RVV‑v1.1 features may affect both correctness and performance.
- Prompt engineering – The LLM results depend on the quality of the prompt; systematic prompt‑optimization was not explored.
- Edge‑case intrinsics – Some highly specialized x86/Neon intrinsics (e.g., gather/scatter with mask) lack direct RVV equivalents, leading to fallback scalar code. Extending the benchmark with more such cases would stress‑test mapping strategies.
- Beyond intrinsics – The authors plan to expand VecIntrinBench to include whole‑function and multi‑kernel workloads, and to evaluate hybrid approaches that combine rule‑based tables with LLM suggestions.
VecIntrinBench opens the door for faster, higher‑quality migration of SIMD‑heavy code to RISC‑V, and its open‑source nature invites the community to build on these initial findings.
Authors
- Liutong Han
- Chu Kang
- Mingjie Xing
- Yanjun Wu
Paper Information
- arXiv ID: 2511.18867v1
- Categories: cs.SE
- Published: November 24, 2025
- PDF: Download PDF