The 185-Microsecond Type Hint

Published: (March 2, 2026 at 05:04 PM EST)
6 min read

Source: Hacker News

How a “trivial” change yielded a 13× throughput increase

We recently released an open‑source Clojure implementation of Roughtime, a protocol for secure time synchronization with cryptographic proof.

What Roughtime does

  1. Client request – The client sends a random nonce to the server.
  2. Server response – The server replies with a signed certificate that includes:
    • the original nonce
    • a timestamp proving the response occurred after the request.
  3. Chaining – Responses can be chained together, providing provable ordering.
  4. Reliability check – If any server’s timestamps conflict with the established ordering, that server is cryptographically “outed” as unreliable.

The rest of the post goes on to describe the specific optimisation that turned a modest implementation into a 13× faster one.

The Heavy Lifting

A single request to our server triggers a surprising amount of work:

1. Queueing

  1. Validation – The incoming request is validated.
  2. Received queue – After validation it enters a received queue.
  3. Batcher – The batcher pulls items from the received queue and creates batches that are dispatched to one of four worker queues.
  4. Worker processing – Each worker queue:
    • Decodes every request in the batch.
    • Groups requests into sub‑batches by protocol version.
    • Generates a response for each sub‑batch.
  5. Sender queue – Responses are placed in a sender queue, which un‑batches them and sends the replies back to the requesting server.

2. Protocol Compatibility

We support the entire evolution of the protocol—from Google’s original specification through all fifteen IETF drafts, i.e., sixteen versions. Consequently, the codebase contains version‑specific logic scattered throughout:

AspectVariation by Version
Version tagsDifferent identifiers
Padding schemesMultiple algorithms
Tag labelsVarying formats
Hash sizes128‑, 256‑, 512‑bit options
Packet layoutsDistinct field orders

In many places, maintaining compatibility took precedence over elegance or raw performance.

3. Recursive Merkle Trees

Each batch is folded into a Merkle tree built with SHA‑512. This requires recursive hashing from the leaves up to the root, making it a purely CPU‑bound operation.

4. Ed25519 Signatures

Finally, every response is signed with Ed25519. Public‑key signatures are computationally expensive and typically dominate the overall latency of the system.

The “Sluggish” Server

Given all that complexity, along with the fact that I’m using a high‑level dynamic programming language, I wasn’t surprised when my initial benchmarks showed the server responding in 200 µs.

I ran a profiler expecting to see SHA‑512 or Ed25519 dominating. Instead, nearly 90 % of the runtime was attributed to the most mundane line in the entire library:

(defn encode-rt-message [msg-map]
  (let [sorted-entries (sort-tags msg-map)
        tag-bytes      (mapv #(tag/tag->bytes (key %)) sorted-entries)
        val-bytes      (mapv #(tag/pad4 (val %)) sorted-entries)

        ;; THE BOTTLENECK:
        val-lens       (mapv alength val-bytes)

        ...])

This line is arguably the most trivial part of the codebase. It iterates over roughly five byte arrays and asks, “How long are you?”—and that’s it.

Yet this single line accounted for almost the entire request time.

The Fix

I wrapped alength in an anonymous function so I could add a type hint:

;; BEFORE (~31 µs)
(mapv alength val-bytes)

;; AFTER (~4 µs)
(mapv (fn [^bytes v] (alength v)) val-bytes)

Profiling the code with and without the type hint shows that the encoding time dropped from 31 µs to 4 µs.

Why Was (mapv alength …) So Slow?

Clojure emitted no reflection warnings when I ran my tests; the code is perfectly legal.
But mapv is a higher‑order function. It receives alength as a generic IFn object and calls invoke() on it for every item. This means that:

  • The compiler cannot inline the operation because the function is passed as a value.
  • alength itself must perform a runtime check (RT.alength) to ensure the argument is an array.
  • Finally, it calls java.lang.reflect.Array.getLength.

The overhead of dynamic dispatch, runtime type checking, and reflection adds up in a tight loop!

What Changed?

(fn [^bytes v] (alength v))

When the function is written this way, the compiler has enough static information to emit a single arraylength bytecode instruction. We replaced a complex chain of method calls with one CPU instruction.

End‑to‑End Benchmark

To verify the impact, I ran a full end‑to‑end benchmark using Criterium’s quick-benchmark.

Test conditions

  • CPU: Apple M2
  • Workers: 4 parallel workers
  • Merkle batch size: 64
  • Crypto: Full (SHA‑512 + Ed25519)

Results

Responses / secµs / response / core
Without type hint19 959200.4
With type hint264 31615.1

That’s a 13 × throughput increase from a single type hint.

Visual comparison

Server throughput before and after the fix. The x‑axis shows the batch size on a logarithmic scale; the y‑axis shows the response rate.

Why Did the Speedup Get Larger?

In isolated tests the improvement was ~8×.
According to Amdahl’s law, we would expect a substantially lower improvement in the real system.
Surprisingly, the speedup grew to ~13× under load.

Working hypothesis

When multiple workers invoke the same reflective, non‑inlinable call site, the JVM cannot optimize it effectively. Removing that reflective barrier lets the JIT inline the call, allowing the code to parallelize cleanly.

Result

  • Better scaling under load
  • Higher overall speedup than predicted by Amdahl’s law

Further investigation is needed to confirm the hypothesis, but the observed behavior aligns with reduced contention in the reflective call path.

The Lesson

I learned that, when optimizing Clojure code, “no reflection warnings” is not always the end of the story. When you pass low‑level primitives through higher‑order interfaces, you may accidentally force the runtime back onto generic (and slower) paths. The compiler needs enough information to emit primitive bytecode.

In this case, the code I thought was complex—the crypto, Merkle trees, and protocol gymnastics—was fine. It was the “trivial” line that killed performance.

Without a profiler, I would never, ever have suspected it.

0 views
Back to Blog

Related posts

Read more »

Line of Defense: Three Systems, Not One

Three Systems, Not One “Rate limiting” is often used as a catch‑all for anything that rejects or slows down requests. In reality there are three distinct mecha...