추론적 디코딩이 출력 분포를 바꾸어 평가에 놓쳤다

발행: 10시간 전 (2026년 6월 18일 PM 03:31 GMT+9)

6 분 소요

출처: Dev.to

TL;DR: 우리는 vLLM에서 speculation decoding을 켜 8B 미세조정된 모델에 대해 레이턴시를 줄였습니다. 1.9배 throughput 향상을 얻었습니다. 세 주 후에 고객이 에이전트의 도구 호출 인수가 미묘하게 변했다는 것을 발견했습니다. Greedy decoding과 draft model을 사용한 greedy decoding은 draft 없이 greedy decoding과 비트 identical하지 않으며, 우리 오프라인 평가는 다른 serving 경로에서 실행되어 드리프트가 포착되지 않았습니다.

I lead the eval team at Nexus Labs. We do enterprise agent automation, Series B, about 14 people in engineering. The model we fine-tune is a Llama-3.1-8B variant that drives tool calls. Latency matters because each agent turn can chain 4 or 5 calls.

So we enabled speculative decoding. Draft model was a distilled 1B. Target was our 8B. The pitch is simple: the draft proposes tokens, the target verifies them in one forward pass, you accept the longest matching prefix. When acceptance is high you get tokens nearly for free.

The throughput number was real. 1.9x at our batch sizes. The problem was everything we assumed about correctness.

The vLLM docs say speculative decoding is lossless for greedy. That is true in exact arithmetic. It is not true in float16 on a GPU.

Here is the thing nobody tells you. The verification step recomputes logits for the drafted tokens in a batched forward pass. The target model alone computes them token‑by‑token. Different batch shapes, different kernel paths, different reduction order. The argmax usually agrees. Usually.

When the top two logits are within a few thousandths of each other, the batched path and the sequential path can pick different tokens. For most text that is invisible. For structured tool‑call output where one token flips "limit": 50 to "limit": 500, it is not invisible at all.

We measured it. Ran the same 2,000 prompts through both paths, greedy, temperature 0.

| Serving path                     | Exact-match outputs | Tool-arg mismatch | Tokens/sec |
|----------------------------------|---------------------|-------------------|------------|
| Target only (no spec) baseline   | 0%                  | 41                |            |
| Spec decode, 1B draft            | 98.8%               | 1.2%              | 78         |
| Spec decode, 3B draft            | 99.4%               | 0.6%              | 64         |

1.2% of outputs differed. On agent traffic that chains calls, a 1.2% per‑call divergence compounds. Over a 5‑call session that’s roughly a 6% chance at least one call drifts.

This is the part I’m actually annoyed about. Our offline eval suite hit the model directly through the HF generate() API. No speculative decoding. No batched verification. Our production serving stack ran vLLM with spec decode on.

We were evaluating one numerical path and shipping another. The eval harness was honest about the model it tested. It just wasn’t testing the model we served.

The fix was boring and correct: evaluate against the exact serving endpoint. We route all eval traffic through the same gateway the app uses, so the eval client and the production client are indistinguishable to the backend. We use Bifrost in front of our vLLM and external providers, which gave us one OpenAI‑compatible endpoint to point both at. The point isn’t the tool. The point is your eval requests must traverse the identical decode path, kernels included.

Here’s the config flag that matters in vLLM:

# vllm serving config
model: /models/nexus-8b-toolcall
speculative_config:
  model: /models/nexus-1b-draft
  num_speculative_tokens: 5
# 이걸 놓쳤음:
# disable_logprobs_during_spec_decoding defaults vary by version.
# pin it and assert it in CI.
speculative_disable_logprobs: false

And the eval‑side assertion we added so this never ships silently again:

# eval 경로와 serving 경로가 다르면 CI 실패
resp = client.chat.completions.create(
    model="nexus-8b-toolcall",
    messages=msgs,
    temperature=0,
    extra_body={"spec_decode": True},   # prod와 일치해야 함
)
assert resp.system_fingerprint == EXPECTED_FINGERPRINT, f"decode path drift: {resp.system_fingerprint}"

We compute a fingerprint from the serving config (draft model hash, num_speculative_tokens, kernel version) and assert it. If someone bumps vLLM or swaps the draft, CI goes red before the eval numbers are trusted.

We kept speculative decoding. The latency win was worth more than 1.2% drift for most of our endpoints. But we did three things.

First, we raised the bar on tool‑call endpoints specifically. For the two customers running financial workflows, we run target‑only, no draft. Slower, exact. They opted in to the cost.

Second, we started running a nightly divergence canary that replays 500 prompts through both serving paths and alerts if mismatch exceeds 1.5%. This caught a vLLM upgrade that shifted draft acceptance logic and pushed mismatch to 2.1%.

Third, all eval traffic now routes through the production endpoint. No more generate() in the harness. If the serving path changes, the eval changes with it.

This costs you reproducibility. Pinning evals to the serving path means a kernel update can move your eval scores even when the weights are frozen. That is correct, but it means “the model regressed” and “the runtime changed” now look the same on the dashboard. You need the fingerprint to tell them apart.

The fingerprint approach is only as good as what you hash. We hash config, not the actual CUDA kernel binary. A driver update that changes reduction order without changing our config would slip through. The nightly canary is the backstop for that, not the assertion.

Target‑only serving for the exact endpoints roughly halved throughput for those customers. We ate that. Bigger draft models shrink the gap but cost more memory and reduce acceptance, so 3B was not a free win either.

And 1.2% is our number, on our model, at our logit margins. A model with sharper output distributions will diverge less. One with flatter logits will diverge more. Measure your own.

vLLM speculative decoding docs
Leviathan et al., “Fast Inference from Transformers via Speculative Decoding”
vLLM GitHub issues on greedy determinism
Bifrost AI gateway
PyTorch numerical reproducibility notes

추론적 디코딩이 출력 분포를 바꾸어 평가에 놓쳤다

관련 글

메인넷 진입: XRPL 대출 프로토콜의 보안 우선 접근법

코드 리뷰가 잘못됐다

의존성 고정 vs 변동 버전 — 보안팀이 반드시 알아야 할 내용

러시아 EGRUL 조회, FNS가 실제 공개한 내용