The 128-arm pattern dispatch problem and how we got from 11ns to 1.1ns
Source: Dev.to
The Problem
Pattern matching on a large set of literal values looks clean in code but hits a wall at runtime. Every on() call constructs case objects for every arm.
With 128 arms, that is 128 object constructions per match call. At ~11 ns per call, this is fine for one‑off use. Inside a hot loop, it is a disaster.
// Clean syntax, 128 case objects constructed per call
return match(x) | on(
lit(0) >> handle_0,
lit(1) >> handle_1,
// … 126 more arms
_ >> default_handler
);
The cases are compile‑time constants. The values do not change. But the library has to construct them every time because the on() call receives them as function arguments, and C++ evaluates arguments at the call site before any optimization pass can intervene.
This is not a bug in the library design. It is a structural cost of the pipeline syntax match(x) | on(...). The syntax gives you readable, composable matching. The cost is that the case list is a runtime object.
Or it was, until we found a way to make it not.
First Attempt: Static Caching with is_constant_evaluated()
The idea was simple. If the match expression appears in a constexpr context, the compiler already knows the subject value and all the case values. We can unfold the entire match at compile time and bypass the runtime machinery entirely.
C++20 gave us std::is_constant_evaluated(). When this returns true, the compiler is evaluating this code in a constexpr context. We can take the static path:
template <typename Subject, typename CasesTuple>
constexpr auto dispatch(Subject& subject, CasesTuple&& cases) {
if (std::is_constant_evaluated()) {
// Compile‑time path: unfold cases statically
return static_dispatch(subject, std::forward<CasesTuple>(cases));
}
// Runtime path: use the normal evaluation engine
return runtime_dispatch(subject, std::forward<CasesTuple>(cases));
}
Nice in theory. In practice, this only helps when the entire call site is constexpr. For runtime matches with compile‑time‑known case values, the cases are still constructed. The constexpr branch does not fire.
We needed something more aggressive: detect at compile time that every case in the list is a static literal, then build a dense dispatch table that the runtime evaluator can use as an O(1) lookup.
The Trap: [[no_unique_address]] and std::is_empty
Static literal patterns look like empty types:
template <auto V, typename Cmp>
struct static_literal_pattern {
static constexpr auto value = V;
[[no_unique_address]] Cmp cmp{};
};
std::equal_to<> is an empty type. [[no_unique_address]] should allow the compiler to optimize cmp to zero size, making the whole pattern trivially small.
We used std::is_empty as the gate for static‑caching eligibility:
// Naïve check: if the pattern is empty, we can cache it
if constexpr (std::is_empty_v<T>) {
// … static cache path
}
This broke on GCC. The reason: [[no_unique_address]] is a permission, not a requirement. GCC does not apply it to non‑empty base classes, even when the member is not a base class. The Cmp member, despite being an empty type, prevents the pattern from satisfying std::is_empty.
The static‑cache path was silently skipped. The dense dispatch table was never built. The 128‑case match stayed at ~11 ns.
The Fix: sizeof<T>‑Based Trait
We switched to a size‑based trait that works regardless of [[no_unique_address]] handling:
template <typename T>
constexpr bool is_cacheable_pattern_v = sizeof(T) == 1;
(Actual implementation details are omitted for brevity.)
With this change the library correctly identifies static literal patterns and builds the dense table.
The Numbers
Benchmarked on an Intel i7‑13700K, GCC 13.2, -O3 -march=native. Subject type: int, 128 literal arms, uniformly random input.
| Approach | Latency | Relative to switch |
|---|---|---|
Original on() (constructs all cases) | 10.79 ns | 9.9× slower |
PTN_ON macro (manual static, best practice) | 1.16 ns | 1.06× slower |
| Auto‑cache (this fix) | 1.13 ns | 1.04× slower |
Hand‑written switch (theoretical lower bound) | 1.09 ns | baseline |
The auto‑cache approach matches the previous best practice (PTN_ON macro) and comes within 0.04 ns of a hand‑written switch. That gap comes from a branch‑predictor mispredict on the bounds check, which is not under library control.
For context: the original implementation was 9.9× slower than a hand‑written switch. The fix brings it to 1.04×. A 10× improvement was achieved by changing a single condition in the type‑traits.
What This Says About Library Design
- Small implementation details matter. A single
std::is_emptycheck caused a 10× slowdown on a major compiler. - Relying on compiler extensions (
[[no_unique_address]]) can be fragile. Different compilers treat the attribute differently; a portable fallback is essential. - Static analysis of patterns can pay off. Detecting that all cases are compile‑time literals lets us replace a generic runtime loop with a dense O(1) table.
- Performance‑critical libraries should provide a “static‑cache” path that works even when the surrounding code is not
constexpr.
By adjusting the cache‑eligibility test to sizeof(T), we turned a hidden performance bug into a measurable win.
Pattern: In performance work, the clean theoretical property is often not the one that matters. The dirty empirical threshold is.
Try it
The implementation is open source at github.com/sentomk/patternia.
- Dispatch planner:
include/ptn/core/common/optimize.hpp - Static literal cache logic:
include/ptn/core/common/eval.hpp
Header‑only, C++17, no dependencies beyond the standard library.