Faster C software with Dynamic Feature Detection

Published: (March 4, 2026 at 01:33 PM EST)
6 min read

Source: Hacker News

Faster C software with Dynamic Feature Detection

I’ve been building some software recently whose performance is very sensitive to the capabilities of the CPU on which it’s running. A portable version of the code does not perform all that well, but we cannot guarantee the presence of optional Instruction Set Architectures (ISAs) which we can use to speed it up. What to do? That’s what we’ll be looking at today, mostly for the wildly popular x86‑64 family of processors (but the general techniques apply anywhere).

Make it the compiler’s problem

← back to top

Compilers are very good at optimizing for a particular target CPU micro‑architecture—e.g. -march=native or -march=znver3.
They know the ISA capabilities of the target CPU and will quietly take advantage of them, at the cost of portability.

So the first way to speed up C software is to build for a newer architecture where the compiler has the tools to make the code faster for you. This won’t work for every problem or scenario, but if it’s an option it’s very easy.

Why it works well on x86‑64

x86‑64 is now a very mature architecture, but there is a wide span of capabilities between the original x86‑64 CPUs and the chips you can buy today. To make this more digestible, Intel defined micro‑architecture levels; each later level includes all the features of its predecessors.

LevelContains (e.g.)Intel (year)AMD (year)
x86‑64‑v1 (base)All 64‑bit2003 (first x86‑64)2003 (first x86‑64)
x86‑64‑v2POPCNT, SSE4.22008 (Nehalem/Westmere)2011 (Bulldozer)
x86‑64‑v3AVX2, BMI22013 (Haswell/Broadwell)2015 (Excavator)
x86‑64‑v4AVX‑512¹2017 (Skylake)2022 (Zen 4)

¹ AVX‑512 is not a single instruction; v4 includes the most useful parts of the AVX‑512 extensions.

Gotchas

  • Some instructions have slow implementations on older silicon (e.g. PEXT/PDEP in BMI2 on AMD before Zen 3).
  • Intel’s market segmentation is aggressive: consumer‑grade AVX‑512 chips are practically nonexistent, and lower‑cost CPUs often lack newer features.

In general, the micro‑architecture levels give you a good baseline for optimization. There are two common ways to use them:

  1. Build for the lowest common denominator in your fleet (today that’s usually v3 or v4).
  2. Build multiple binaries – one for newer processors and one for older ones.

The second approach is less ideal if you don’t control all the hardware you’ll run on. Fortunately, popular compilers provide a solution: indirect functions (IFUNCs).

Using IFUNCs with GCC/Clang

IFUNCs let the dynamic linker choose the best implementation at load time. With recent GCC and Clang you can let the compiler generate the resolver automatically:

[[gnu::target_clones("avx2,default")]]   // C23 attribute syntax (GCC/Clang)
void *my_func(void *data) {
    /* ... */
}

The equivalent pre‑C23 syntax is __attribute__((target_clones("avx2,default"))).

What happens:

  • Two versions of my_func are emitted – one compiled with -mavx2 and one with the default flags.
  • The compiler also generates a resolver function that the dynamic linker calls at program start‑up.
  • Calls to my_func are then bound to the most optimal version for the current CPU.

If you’re lucky, this alone gives you a noticeable speed‑up. If not, you may need to coax the compiler into autovectorisation (e.g., by adding alignment annotations or small code tweaks). That process can be finicky and is beyond the scope of this short guide.

Manual optimisation with Intrinsics

Sometimes you need to write multiple versions of an algorithm to get the best performance.
Either you can’t rely on the compiler’s autovectorisation (e.g., for SIMD) or you need to use specific intrinsics (as I do for this project).

To take advantage of intrinsics directly, we provide two versions of an algorithm:

  1. a portable implementation, and
  2. an implementation that uses the intrinsics.

Statically selecting AVX2

#ifdef __AVX2__          // defined by the compiler when AVX2 is supported
  #include  // header with AVX2 intrinsics

  void *my_func(void *data) { /* AVX2 version */ }
#else
  void *my_func(void *data) { /* portable version */ }
#endif

With this technique we can still build for a specific target while gaining direct access to the intrinsics that make things faster. However, we would like to avoid compiling separate binaries for each target.

Enabling intrinsics per‑function (gcc/clang)

There is no fully portable way to do this, but gcc and clang provide useful extensions.

/* Ask the compiler to enable AVX2 for the following code */
#pragma GCC push_options
#pragma GCC target ("avx2")
#pragma clang attribute push \
  (__attribute__((target("avx2"))), apply_to = function)

/* Include the header with AVX2 enabled */
#include 

/* Undo the option change so the rest of the translation unit stays portable */
#pragma GCC pop_options
#pragma clang attribute pop

/* ------------------------------------------------------------------ */
/* Functions compiled with AVX2 */
[[gnu::target("avx2")]]
void *my_func_avx2(void *data) { /* AVX2 implementation */ }

/* Portable fallback */
void *my_func_portable(void *data) { /* generic implementation */ }

Runtime dispatch

Because we are limiting ourselves to gcc/clang on x86‑64, we can use the compiler‑provided runtime‑CPU detection to choose the appropriate implementation:

void *my_func(void *data) {
    return __builtin_cpu_supports("avx") ?
           my_func_avx2(data) :
           my_func_portable(data);
}

Using IFUNC (indirect functions)

An alternative is to let the dynamic linker resolve the best version via an IFUNC resolver. This requires a small amount of extra boilerplate:

/* Resolver called by the dynamic linker */
static void *(*resolve_my_func(void))(void *) {
    __builtin_cpu_init();               // normally called automatically for IFUNCs
    return __builtin_cpu_supports("avx") ?
           my_func_avx2 :
           my_func_portable;
}

/* The public entry point – the linker will replace it with the resolver’s result */
void *my_func(void *data) __attribute__((ifunc("resolve_my_func")));

The resolver can contain any logic you like, allowing you to support many different variants (e.g., AMD BMI2 before Zen 3, Intel AVX‑512 on Ice Lake, etc.). At program start‑up, the best implementation is selected automatically.


Feel free to adapt the patterns above to the specific intrinsics and CPUs you target.

Notes {#notes}

  • MUSL libc does not (yet) support IFUNCS; it’s not a simple feature.
  • I haven’t said a word about Windows support. I do not have a Windows machine to test on, and in any case the project I’m doing this for is written in C23, while the compiler of choice for Windows (outside of WSL), MSVC, supports most of C11.
  • You’d be forgiven for thinking Microsoft doesn’t actually want people to port C software to Windows!
0 views
Back to Blog

Related posts

Read more »

Get ready for Google I/O 2026

Google I/O returns May 19–20 Google I/O is back! Join us online as we share our latest AI breakthroughs and updates in products across the company, from Gemini...