The 128-arm pattern dispatch problem and how we got from 11ns to 1.1ns

Published: 1 day ago (May 3, 2026 at 08:06 AM EDT)

5 min read

Source: Dev.to

The Problem

Pattern matching on a large set of literal values looks clean in code but hits a wall at runtime. Every on() call constructs case objects for every arm.

With 128 arms, that is 128 object constructions per match call. At ~11 ns per call, this is fine for one‑off use. Inside a hot loop, it is a disaster.

// Clean syntax, 128 case objects constructed per call
return match(x) | on(
    lit(0) >> handle_0,
    lit(1) >> handle_1,
    // … 126 more arms
    _   >> default_handler
);

The cases are compile‑time constants. The values do not change. But the library has to construct them every time because the on() call receives them as function arguments, and C++ evaluates arguments at the call site before any optimization pass can intervene.

This is not a bug in the library design. It is a structural cost of the pipeline syntax match(x) | on(...). The syntax gives you readable, composable matching. The cost is that the case list is a runtime object.

Or it was, until we found a way to make it not.

First Attempt: Static Caching with `is_constant_evaluated()`

The idea was simple. If the match expression appears in a constexpr context, the compiler already knows the subject value and all the case values. We can unfold the entire match at compile time and bypass the runtime machinery entirely.

C++20 gave us std::is_constant_evaluated(). When this returns true, the compiler is evaluating this code in a constexpr context. We can take the static path:

template <typename Subject, typename CasesTuple>
constexpr auto dispatch(Subject& subject, CasesTuple&& cases) {
    if (std::is_constant_evaluated()) {
        // Compile‑time path: unfold cases statically
        return static_dispatch(subject, std::forward<CasesTuple>(cases));
    }
    // Runtime path: use the normal evaluation engine
    return runtime_dispatch(subject, std::forward<CasesTuple>(cases));
}

Nice in theory. In practice, this only helps when the entire call site is constexpr. For runtime matches with compile‑time‑known case values, the cases are still constructed. The constexpr branch does not fire.

We needed something more aggressive: detect at compile time that every case in the list is a static literal, then build a dense dispatch table that the runtime evaluator can use as an O(1) lookup.

The Trap: `[[no_unique_address]]` and `std::is_empty`

Static literal patterns look like empty types:

template <auto V, typename Cmp>
struct static_literal_pattern {
    static constexpr auto value = V;
    [[no_unique_address]] Cmp cmp{};
};

std::equal_to<> is an empty type. [[no_unique_address]] should allow the compiler to optimize cmp to zero size, making the whole pattern trivially small.

We used std::is_empty as the gate for static‑caching eligibility:

// Naïve check: if the pattern is empty, we can cache it
if constexpr (std::is_empty_v<T>) {
    // … static cache path
}

This broke on GCC. The reason: [[no_unique_address]] is a permission, not a requirement. GCC does not apply it to non‑empty base classes, even when the member is not a base class. The Cmp member, despite being an empty type, prevents the pattern from satisfying std::is_empty.

The static‑cache path was silently skipped. The dense dispatch table was never built. The 128‑case match stayed at ~11 ns.

The Fix: `sizeof<T>`‑Based Trait

We switched to a size‑based trait that works regardless of [[no_unique_address]] handling:

template <typename T>
constexpr bool is_cacheable_pattern_v = sizeof(T) == 1;

(Actual implementation details are omitted for brevity.)

With this change the library correctly identifies static literal patterns and builds the dense table.

The Numbers

Benchmarked on an Intel i7‑13700K, GCC 13.2, -O3 -march=native. Subject type: int, 128 literal arms, uniformly random input.

Approach	Latency	Relative to `switch`
Original `on()` (constructs all cases)	10.79 ns	9.9× slower
`PTN_ON` macro (manual static, best practice)	1.16 ns	1.06× slower
Auto‑cache (this fix)	1.13 ns	1.04× slower
Hand‑written `switch` (theoretical lower bound)	1.09 ns	baseline

The auto‑cache approach matches the previous best practice (PTN_ON macro) and comes within 0.04 ns of a hand‑written switch. That gap comes from a branch‑predictor mispredict on the bounds check, which is not under library control.

For context: the original implementation was 9.9× slower than a hand‑written switch. The fix brings it to 1.04×. A 10× improvement was achieved by changing a single condition in the type‑traits.

What This Says About Library Design

Small implementation details matter. A single std::is_empty check caused a 10× slowdown on a major compiler.
Relying on compiler extensions ([[no_unique_address]]) can be fragile. Different compilers treat the attribute differently; a portable fallback is essential.
Static analysis of patterns can pay off. Detecting that all cases are compile‑time literals lets us replace a generic runtime loop with a dense O(1) table.
Performance‑critical libraries should provide a “static‑cache” path that works even when the surrounding code is not constexpr.

By adjusting the cache‑eligibility test to sizeof(T), we turned a hidden performance bug into a measurable win.

Pattern: In performance work, the clean theoretical property is often not the one that matters. The dirty empirical threshold is.

Try it

The implementation is open source at github.com/sentomk/patternia.

Dispatch planner: include/ptn/core/common/optimize.hpp
Static literal cache logic: include/ptn/core/common/eval.hpp

Header‑only, C++17, no dependencies beyond the standard library.

The 128-arm pattern dispatch problem and how we got from 11ns to 1.1ns

The Problem

First Attempt: Static Caching with `is_constant_evaluated()`

The Trap: `[[no_unique_address]]` and `std::is_empty`

The Fix: `sizeof<T>`‑Based Trait

The Numbers

What This Says About Library Design

Try it

Related posts

I’m shipping zerocopy, a Flutter package that eliminates the 'Copy Tax' between Dart and C++. No serialization, no cloning, and zero GC pressure.

C# သင်ခန်းစာ - ၈

C# သင်ခန်းစာ - ၇

We Rewrote Our Angular 18 App in React 20 and Increased Developer Velocity by 40%

The Problem

First Attempt: Static Caching with is_constant_evaluated()

The Trap: [[no_unique_address]] and std::is_empty

The Fix: sizeof<T>‑Based Trait

The Numbers

What This Says About Library Design

Try it

Related posts

I’m shipping zerocopy, a Flutter package that eliminates the 'Copy Tax' between Dart and C++. No serialization, no cloning, and zero GC pressure.

C# သင်ခန်းစာ - ၈

C# သင်ခန်းစာ - ၇

We Rewrote Our Angular 18 App in React 20 and Increased Developer Velocity by 40%

First Attempt: Static Caching with `is_constant_evaluated()`

The Trap: `[[no_unique_address]]` and `std::is_empty`

The Fix: `sizeof<T>`‑Based Trait