The 185-Microsecond Type Hint
Source: Hacker News
How a “trivial” change yielded a 13× throughput increase
We recently released an open‑source Clojure implementation of Roughtime, a protocol for secure time synchronization with cryptographic proof.
What Roughtime does
- Client request – The client sends a random nonce to the server.
- Server response – The server replies with a signed certificate that includes:
- the original nonce
- a timestamp proving the response occurred after the request.
- Chaining – Responses can be chained together, providing provable ordering.
- Reliability check – If any server’s timestamps conflict with the established ordering, that server is cryptographically “outed” as unreliable.
The rest of the post goes on to describe the specific optimisation that turned a modest implementation into a 13× faster one.
The Heavy Lifting
A single request to our server triggers a surprising amount of work:
1. Queueing
- Validation – The incoming request is validated.
- Received queue – After validation it enters a received queue.
- Batcher – The batcher pulls items from the received queue and creates batches that are dispatched to one of four worker queues.
- Worker processing – Each worker queue:
- Decodes every request in the batch.
- Groups requests into sub‑batches by protocol version.
- Generates a response for each sub‑batch.
- Sender queue – Responses are placed in a sender queue, which un‑batches them and sends the replies back to the requesting server.
2. Protocol Compatibility
We support the entire evolution of the protocol—from Google’s original specification through all fifteen IETF drafts, i.e., sixteen versions. Consequently, the codebase contains version‑specific logic scattered throughout:
| Aspect | Variation by Version |
|---|---|
| Version tags | Different identifiers |
| Padding schemes | Multiple algorithms |
| Tag labels | Varying formats |
| Hash sizes | 128‑, 256‑, 512‑bit options |
| Packet layouts | Distinct field orders |
In many places, maintaining compatibility took precedence over elegance or raw performance.
3. Recursive Merkle Trees
Each batch is folded into a Merkle tree built with SHA‑512. This requires recursive hashing from the leaves up to the root, making it a purely CPU‑bound operation.
4. Ed25519 Signatures
Finally, every response is signed with Ed25519. Public‑key signatures are computationally expensive and typically dominate the overall latency of the system.
The “Sluggish” Server
Given all that complexity, along with the fact that I’m using a high‑level dynamic programming language, I wasn’t surprised when my initial benchmarks showed the server responding in 200 µs.
I ran a profiler expecting to see SHA‑512 or Ed25519 dominating. Instead, nearly 90 % of the runtime was attributed to the most mundane line in the entire library:
(defn encode-rt-message [msg-map]
(let [sorted-entries (sort-tags msg-map)
tag-bytes (mapv #(tag/tag->bytes (key %)) sorted-entries)
val-bytes (mapv #(tag/pad4 (val %)) sorted-entries)
;; THE BOTTLENECK:
val-lens (mapv alength val-bytes)
...])
This line is arguably the most trivial part of the codebase. It iterates over roughly five byte arrays and asks, “How long are you?”—and that’s it.
Yet this single line accounted for almost the entire request time.
The Fix
I wrapped alength in an anonymous function so I could add a type hint:
;; BEFORE (~31 µs)
(mapv alength val-bytes)
;; AFTER (~4 µs)
(mapv (fn [^bytes v] (alength v)) val-bytes)
Profiling the code with and without the type hint shows that the encoding time dropped from 31 µs to 4 µs.
Why Was (mapv alength …) So Slow?
Clojure emitted no reflection warnings when I ran my tests; the code is perfectly legal.
But mapv is a higher‑order function. It receives alength as a generic IFn object and calls invoke() on it for every item. This means that:
- The compiler cannot inline the operation because the function is passed as a value.
alengthitself must perform a runtime check (RT.alength) to ensure the argument is an array.- Finally, it calls
java.lang.reflect.Array.getLength.
The overhead of dynamic dispatch, runtime type checking, and reflection adds up in a tight loop!
What Changed?
(fn [^bytes v] (alength v))
When the function is written this way, the compiler has enough static information to emit a single arraylength bytecode instruction. We replaced a complex chain of method calls with one CPU instruction.
End‑to‑End Benchmark
To verify the impact, I ran a full end‑to‑end benchmark using Criterium’s quick-benchmark.
Test conditions
- CPU: Apple M2
- Workers: 4 parallel workers
- Merkle batch size: 64
- Crypto: Full (SHA‑512 + Ed25519)
Results
| Responses / sec | µs / response / core | |
|---|---|---|
| Without type hint | 19 959 | 200.4 |
| With type hint | 264 316 | 15.1 |
That’s a 13 × throughput increase from a single type hint.
Visual comparison

Why Did the Speedup Get Larger?
In isolated tests the improvement was ~8×.
According to Amdahl’s law, we would expect a substantially lower improvement in the real system.
Surprisingly, the speedup grew to ~13× under load.
Working hypothesis
When multiple workers invoke the same reflective, non‑inlinable call site, the JVM cannot optimize it effectively. Removing that reflective barrier lets the JIT inline the call, allowing the code to parallelize cleanly.
Result
- Better scaling under load
- Higher overall speedup than predicted by Amdahl’s law
Further investigation is needed to confirm the hypothesis, but the observed behavior aligns with reduced contention in the reflective call path.
The Lesson
I learned that, when optimizing Clojure code, “no reflection warnings” is not always the end of the story. When you pass low‑level primitives through higher‑order interfaces, you may accidentally force the runtime back onto generic (and slower) paths. The compiler needs enough information to emit primitive bytecode.
In this case, the code I thought was complex—the crypto, Merkle trees, and protocol gymnastics—was fine. It was the “trivial” line that killed performance.
Without a profiler, I would never, ever have suspected it.