[Paper] AVX / NEON Intrinsic Functions: When Should They Be Used?

Published: 1 month ago (January 8, 2026 at 08:21 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.04922v1

Overview

The paper “AVX / NEON Intrinsic Functions: When Should They Be Used?” investigates the real‑world trade‑offs of hand‑written SIMD intrinsics versus relying on modern compilers’ auto‑vectorisation. By running a cross‑platform benchmark suite on a variety of operating systems, CPU architectures (x86‑64 with AVX/AVX2 and ARM with NEON) and compilers, the authors aim to give developers concrete guidance on when the extra effort of using intrinsics actually pays off.

Key Contributions

Comprehensive benchmark suite covering a range of typical compute kernels (loops, reductions, conditional branches) across Windows, Linux, macOS, Android, and iOS.
Quantitative comparison of three code paths: plain scalar C/C++, compiler‑auto‑vectorised C/C++, and hand‑written AVX/NEON intrinsics.
Decision matrix that maps OS + architecture + compiler combinations to the expected benefit of intrinsics.
Insightful case studies showing that intrinsics can shrink execution time to ~5 % of the scalar baseline for branch‑heavy kernels, while offering negligible gains for straight‑line arithmetic that compilers already vectorise efficiently.
Practical recommendations for developers on when to invest in intrinsics and when to trust the compiler.

Methodology

Kernel selection – The authors chose a representative set of micro‑benchmarks (e.g., element‑wise addition, dot product, image convolution, conditional filtering) that are common in graphics, signal processing, and machine‑learning workloads.
Platform matrix – Tests were run on:
- x86‑64 CPUs supporting AVX/AVX2 (Intel i7‑7700K, AMD Ryzen 7 3700X)
- ARM Cortex‑A53/A57 cores with NEON support (Raspberry Pi 3, Qualcomm Snapdragon 845)
- OSes: Windows 10, Ubuntu 22.04, macOS 13, Android 13, iOS 16.
Compiler configurations – GCC 12, Clang 15, MSVC 19, and Android NDK clang were used with aggressive optimisation flags (-O3 -march=native etc.).
Three implementations per kernel:
- Scalar – plain C/C++ loops.
- Auto‑vectorised – same source compiled with optimisation flags, letting the compiler generate SIMD instructions.
- Intrinsic – hand‑written AVX (for x86) or NEON (for ARM) intrinsics.
Metrics – Execution time (median of 30 runs), instruction‑count via hardware performance counters, and binary size impact.
Statistical analysis – Paired t‑tests to confirm significance of observed speed‑ups.

Results & Findings

Kernel type	Typical speed‑up (intrinsic vs. scalar)	Intrinsic vs. auto‑vectorised
Straight‑line arithmetic (e.g., vector add)	1.2 × – 1.5 ×	≤ 1.05 × (often identical)
Reductions (dot product)	2 × – 3 ×	1.1 × – 1.3 ×
Conditional branching (e.g., mask‑based filter)	≈ 20 × (intrinsic ≈ 5 % of scalar)	1.2 × – 1.6 ×
Memory‑bound kernels (image convolution)	1.5 × – 2 ×	1.0 × – 1.2 ×

Auto‑vectorisation is surprisingly strong on modern compilers for pure arithmetic loops; hand‑written intrinsics rarely beat them by more than 5 %.
Branch‑heavy kernels benefit dramatically from intrinsics because compilers struggle to generate efficient predicated SIMD code.
Binary size grows modestly (≈ 5‑10 % increase) when intrinsics are added, but the impact is negligible for most applications.
Cross‑platform consistency – The decision matrix shows that on ARM/NEON, the same patterns hold, though the absolute speed‑ups are slightly lower due to narrower vector width (128‑bit vs. 256‑bit AVX).

Practical Implications

When to write intrinsics:
- If your algorithm contains data‑dependent conditionals inside tight loops (e.g., per‑pixel masks, early‑exit checks).
- When profiling reveals that the compiler’s auto‑vectoriser emits scalar fallback for a hotspot.
When to avoid intrinsics:
- Straightforward linear algebra, signal‑processing filters, or any compute‑bound loop that the compiler already vectorises.
- Projects that need to maintain a single codebase across x86 and ARM – the added maintenance cost of two intrinsic code paths may outweigh modest gains.
Tooling tip: Use compiler flags like -ftree-vectorize -fopt-info-vec (GCC/Clang) or /Qvec-report:2 (MSVC) to quickly see which loops are auto‑vectorised. If a critical loop is not, consider an intrinsic rewrite or a higher‑level abstraction (e.g., OpenMP SIMD, ISPC).
Performance‑first workflow:
1. Write clean scalar code.
2. Enable aggressive optimisation and verify auto‑vectorisation.
3. Benchmark; if the hotspot is branch‑heavy and still scalar, prototype an intrinsic version.
4. Measure; keep the intrinsic version only if the speed‑up exceeds ~1.5‑2× (or meets a latency budget).

Limitations & Future Work

Benchmark scope – The study focuses on micro‑kernels; larger real‑world applications (e.g., full‑frame video codecs) may exhibit different scaling behaviours.
Compiler versions – Only the latest stable releases were tested; older toolchains could change the balance between auto‑vectorisation and intrinsics.
Hardware diversity – The ARM side is limited to a few mid‑range cores; high‑end ARM CPUs with SVE (Scalable Vector Extension) were not evaluated.
Future directions suggested by the authors include: extending the suite to SVE and AVX‑512, exploring auto‑vectorisation hints (pragma directives, #pragma omp simd), and building a lightweight decision‑support tool that integrates with CI pipelines to flag when intrinsics might be beneficial.

Authors

Théo Boivin
Joeffrey Legaux

Paper Information

arXiv ID: 2601.04922v1
Categories: cs.SE
Published: January 8, 2026
PDF: Download PDF

[Paper] AVX / NEON Intrinsic Functions: When Should They Be Used?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SSR: Safeguarding Staking Rewards by Defining and Detecting Logical Defects in DeFi Staking

[Paper] EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents

[Paper] StriderSPD: Structure-Guided Joint Representation Learning for Binary Security Patch Detection

[Paper] From Issues to Insights: RAG-based Explanation Generation from Software Engineering Artifacts