[Paper] DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Published: 3 days ago (February 6, 2026 at 03:53 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.06502v1

Overview

Large Language Model (LLM) serving systems rely on a key‑value (KV) cache that stores intermediate token representations, allowing subsequent requests that share a prompt prefix to skip recomputation. While “cache‑affinity” scheduling (sending similar prompts to the same node) maximizes cache reuse, it often creates hot spots that hurt overall load balance. The paper DualMap introduces a unified scheduling strategy that simultaneously preserves cache affinity and keeps the cluster evenly loaded, delivering up to a 2.25× boost in effective request capacity under strict time‑to‑first‑token (TTFT) Service‑Level Objectives (SLOs).

Key Contributions

Dual‑mapping scheduler: each request is hashed to two candidate instances using independent hash functions, then the system picks the better candidate based on real‑time load and cache‑affinity signals.
Power‑of‑two‑choices adaptation: the dual‑hash design naturally spreads distinct prompts across the cluster while still giving a high probability that similar prompts land on the same node.
SLO‑aware routing: the scheduler prefers cache‑affinity but falls back to load‑aware placement when a request’s TTFT threatens to breach its SLO.
Hotspot‑aware rebalancing: a lightweight migration mechanism moves overloaded requests to under‑utilized nodes, mitigating hot spots without full cluster reshuffling.
Dual‑hash‑ring scaling: supports rapid addition or removal of instances with only local updates, avoiding expensive global remapping.
Empirical validation: experiments on production‑grade LLM workloads show up to 2.25× higher request throughput at the same TTFT SLO compared with the best prior schedulers.

Methodology

Dual Hashing – For every incoming request, the system computes two hashes of the prompt (e.g., hash1(prompt) and hash2(prompt)). Each hash maps to a distinct serving instance in the cluster.
Candidate Selection – The scheduler inspects the current load (CPU/GPU utilization, queue length) and the cache state of the two candidates. It picks the instance that offers the best trade‑off:
If both have the same prompt prefix cached → choose the less loaded one.
If only one has the cache entry → prefer that one unless its load exceeds a threshold.
SLO‑aware Fallback – When the estimated TTFT for the preferred candidate exceeds the request’s SLO, the algorithm switches to the other candidate (or a load‑balanced third choice) to keep latency guarantees.
Hotspot Detection & Rebalancing – Periodically, the system monitors per‑node request rates. If a node’s load surpasses a configurable hotspot threshold, a subset of its “non‑affine” requests are migrated to under‑loaded nodes using a lightweight state transfer protocol.
Scaling via Dual‑Hash‑Ring – Nodes are arranged on two logical hash rings (one per hash function). Adding or removing a node only requires updating the ring pointers for the affected range, keeping the remapping cost O(1) per request.

All components are implemented as thin extensions to existing LLM serving stacks (e.g., vLLM, TGI), requiring only a plug‑in scheduler and minimal changes to the cache‑lookup logic.

Results & Findings

Metric	Baseline (Cache‑Affinity Only)	Baseline (Load‑Balanced Only)	DualMap
Effective request capacity (requests per second under 200 ms TTFT SLO)	1.0×	0.78×	2.25×
Cache hit ratio	68 %	45 %	62 %
99th‑percentile TTFT	210 ms (SLO breach)	190 ms (no cache reuse)	195 ms (within SLO)
Hotspot frequency (nodes > 80 % utilization)	27 %	12 %	4 %
Scaling overhead (adding 4 nodes)	15 % request latency spike	12 %	<2 %

Cache reuse stays high because most requests with shared prefixes still land on the same node thanks to the dual‑hash “collision” probability.
Load balance improves dramatically; the “power of two choices” principle ensures that even with skewed prompt distributions, the system avoids persistent hot spots.
SLO compliance is maintained: the SLO‑aware fallback prevents latency spikes while still leveraging cache when possible.

Practical Implications

Cost savings – By reusing KV caches more effectively, GPU memory is used more efficiently, reducing the number of required instances for a given throughput.
Predictable latency – Developers can set strict TTFT SLOs (e.g., 200 ms) and rely on DualMap to honor them without sacrificing the performance gains of cache reuse.
Simplified ops – The dual‑hash‑ring scaling mechanism means adding capacity (autoscaling) or performing rolling upgrades incurs negligible disruption, a boon for cloud‑native deployments.
Compatibility – DualMap works as a drop‑in scheduler for popular LLM serving frameworks, so existing inference pipelines can adopt it without rewriting model code.
Edge use‑cases – For multi‑tenant edge clusters where hardware is scarce, DualMap’s ability to keep hot spots low while still exploiting prompt similarity can enable real‑time LLM services on limited devices.

Limitations & Future Work

Prompt similarity granularity – DualMap treats prompts as identical only when they share the exact same prefix; more nuanced similarity metrics (e.g., fuzzy matching) could further improve cache reuse.
State migration cost – While lightweight, moving requests between nodes still incurs a brief pause for KV cache transfer; optimizing this for ultra‑low‑latency scenarios remains an open challenge.
Evaluation scope – Experiments focus on single‑model serving (e.g., LLaMA‑2 13B). Extending the approach to heterogeneous model ensembles or multi‑modal workloads warrants further study.
Security & isolation – In multi‑tenant environments, sharing cache across tenants may raise data‑leak concerns; integrating tenant‑aware cache partitioning is a promising direction.

Overall, DualMap demonstrates that a modest algorithmic tweak—dual hashing with intelligent candidate selection—can reconcile two historically opposing goals in LLM serving, delivering tangible performance and cost benefits for developers building large‑scale AI services.

Authors

Ying Yuan
Pengfei Zuo
Bo Wang
Zhangyu Chen
Zhipeng Tan
Zhou Yu

Paper Information

arXiv ID: 2602.06502v1
Categories: cs.DC
Published: February 6, 2026
PDF: Download PDF

[Paper] DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Distributed Knowledge in Simplicial Models

[Paper] Same Engine, Multiple Gears: Parallelizing Fixpoint Iteration at Different Granularities (Extended Version)

[Paper] Wonderboom -- Efficient, and Censorship-Resilient Signature Aggregation for Million Scale Consensus

[Paper] FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training