Speculative decoding shifted our output distribution and evals missed it

Published: (June 18, 2026 at 02:31 AM EDT)
5 min read
Source: Dev.to

Source: Dev.to

TL;DR: We turned on speculative decoding in vLLM to cut latency on a fine-tuned 8B. Got a 1.9x throughput win. Three weeks later a customer flagged that the agent’s tool-call arguments had subtly changed. Greedy decoding with a draft model is not bit-identical to greedy decoding without one, and our offline evals never caught the drift because they ran on a different serving path. I lead the eval team at Nexus Labs. We do enterprise agent automation, Series B, about 14 people in engineering. The model we fine-tune is a Llama-3.1-8B variant that drives tool calls. Latency matters because each agent turn can chain 4 or 5 calls. So we enabled speculative decoding. Draft model was a distilled 1B. Target was our 8B. The pitch is simple: the draft proposes tokens, the target verifies them in one forward pass, you accept the longest matching prefix. When acceptance is high you get tokens nearly for free. The throughput number was real. 1.9x at our batch sizes. The problem was everything we assumed about correctness. The vLLM docs say speculative decoding is lossless for greedy. That is true in exact arithmetic. It is not true in float16 on a GPU. Here is the thing nobody tells you. The verification step recomputes logits for the drafted tokens in a batched forward pass. The target model alone computes them token-by-token. Different batch shapes, different kernel paths, different reduction order. The argmax usually agrees. Usually. When the top two logits are within a few thousandths of each other, the batched path and the sequential path can pick different tokens. For most text that is invisible. For structured tool-call output where one token flips “limit”: 50 to “limit”: 500, it is not invisible at all. We measured it. Ran the same 2,000 prompts through both paths, greedy, temperature 0.

Serving path Exact-match outputs Tool-arg mismatch Tokens/sec

Target only (no spec) baseline 0% 41

Spec decode, 1B draft 98.8% 1.2% 78

Spec decode, 3B draft 99.4% 0.6% 64

1.2% of outputs differed. On agent traffic that chains calls, a 1.2% per-call divergence compounds. Over a 5-call session that’s roughly a 6% chance at least one call drifts. This is the part I’m actually annoyed about. Our offline eval suite hit the model directly through the HF generate() API. No speculative decoding. No batched verification. Our production serving stack ran vLLM with spec decode on. We were evaluating one numerical path and shipping another. The eval harness was honest about the model it tested. It just wasn’t testing the model we served. The fix was boring and correct: evaluate against the exact serving endpoint. We route all eval traffic through the same gateway the app uses, so the eval client and the production client are indistinguishable to the backend. We use Bifrost in front of our vLLM and external providers, which gave us one OpenAI-compatible endpoint to point both at. The point isn’t the tool. The point is your eval requests must traverse the identical decode path, kernels included. Here’s the config flag that matters in vLLM:

vllm serving config

model: /models/nexus-8b-toolcall speculative_config: model: /models/nexus-1b-draft num_speculative_tokens: 5

this is the one we missed:

disable_logprobs_during_spec_decoding defaults vary by version.

pin it and assert it in CI.

speculative_disable_logprobs: false

And the eval-side assertion we added so this never ships silently again:

fail CI if eval path != serving path

resp = client.chat.completions.create( model=“nexus-8b-toolcall”, messages=msgs, temperature=0, extra_body={“spec_decode”: True}, # must match prod ) assert resp.system_fingerprint == EXPECTED_FINGERPRINT, f”decode path drift: {resp.system_fingerprint}”

We compute a fingerprint from the serving config (draft model hash, num_speculative_tokens, kernel version) and assert it. If someone bumps vLLM or swaps the draft, CI goes red before the eval numbers are trusted. We kept speculative decoding. The latency win was worth more than 1.2% drift for most of our endpoints. But we did three things. First, we raised the bar on tool-call endpoints specifically. For the two customers running financial workflows, we run target-only, no draft. Slower, exact. They opted in to the cost. Second, we started running a nightly divergence canary that replays 500 prompts through both serving paths and alerts if mismatch exceeds 1.5%. This caught a vLLM upgrade that shifted draft acceptance logic and pushed mismatch to 2.1%. Third, all eval traffic now routes through the production endpoint. No more generate() in the harness. If the serving path changes, the eval changes with it. This costs you reproducibility. Pinning evals to the serving path means a kernel update can move your eval scores even when the weights are frozen. That is correct, but it means “the model regressed” and “the runtime changed” now look the same on the dashboard. You need the fingerprint to tell them apart. The fingerprint approach is only as good as what you hash. We hash config, not the actual CUDA kernel binary. A driver update that changes reduction order without changing our config would slip through. The nightly canary is the backstop for that, not the assertion. Target-only serving for the exact endpoints roughly halved throughput for those customers. We ate that. Bigger draft models shrink the gap but cost more memory and reduce acceptance, so 3B was not a free win either. And 1.2% is our number, on our model, at our logit margins. A model with sharper output distributions will diverge less. One with flatter logits will diverge more. Measure your own. vLLM speculative decoding docs Leviathan et al., “Fast Inference from Transformers via Speculative Decoding” vLLM GitHub issues on greedy determinism Bifrost AI gateway PyTorch numerical reproducibility notes

0 views
Back to Blog

Related posts

Read more »

The Model Doesn't Remember. You Do

Introduction Before I dug into how an LLM works, I assumed each chat stored its memory or context in its own. The moment I realized it was just an array with al...