[Paper] GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing
Source: arXiv - 2602.11688v1
Overview
The paper presents GORGO, a cross‑region load‑balancing strategy for large language model (LLM) inference that simultaneously maximizes reuse of the KV‑cache (the “prefix cache” that stores attention keys and values) and minimizes the network latency incurred when routing requests between data‑center regions. By treating total serving cost as a blend of compute availability, network round‑trip time, and cache‑hit probability, GORGO achieves noticeably lower Time‑to‑First‑Token (TTFT), especially for the high‑percentile (P99) tail that matters most in production APIs.
Key Contributions
- Cost‑aware routing formulation – Defines a unified objective that balances three factors: remaining compute capacity, inter‑region network latency, and KV‑cache overlap.
- GORGO algorithm – A lightweight, centralized router that selects the optimal replica for each incoming request based on the above cost model.
- Extensive profiling suite – Real‑world measurements on a custom multi‑region LLM serving stack that isolate latency contributions from prefill, decode, network, and cache lookup.
- Comprehensive evaluation – Benchmarks against three baselines (least‑load, prefix‑similarity, and a centralized proxy implementing prior policies) on both median and P99 TTFT.
- Demonstrated 2.5× median TTFT improvement – Shows that a network‑aware, centralized router can beat decentralized cache‑first heuristics while avoiding synchronization overhead.
Methodology
-
System Model – The authors model each region as a set of identical LLM replicas that maintain a KV‑cache of recent token prefixes. A request arrives with a prompt; if the prompt’s prefix already exists in a replica’s cache, the replica can skip the expensive “prefill” phase.
-
Cost Function – For a given request r and candidate replica i, the total cost is:
[ C_{i}(r) = \alpha \cdot \text{Latency}{\text{net}}(i) + \beta \cdot \frac{1}{\text{CacheHitProb}{i}(r)} + \gamma \cdot \frac{1}{\text{ComputeAvail}_{i}} ]
where α, β, γ are tunable weights reflecting service‑level objectives.
-
Routing Decision – The centralized GORGO router maintains lightweight statistics (current load, recent cache‑prefix histograms, and measured inter‑region RTTs). For each incoming request it computes Cᵢ(r) for all replicas and forwards the request to the replica with the lowest cost.
-
Implementation – GORGO runs as a stateless HTTP reverse‑proxy that intercepts the request, queries a shared in‑memory store for the latest metrics, and rewrites the upstream target. The proxy also updates the metrics after each request finishes, keeping the cost model fresh without heavy synchronization.
-
Baselines –
- Least‑load: simple round‑robin based on CPU utilization.
- Prefix‑similarity: picks the replica with the highest cached‑prefix overlap, ignoring network latency.
- Centralized proxy with prior policy: replicates earlier work that only optimizes cache reuse but still uses a central router.
-
Evaluation Setup – Experiments run on a three‑region deployment (US‑East, EU‑West, AP‑South) with a 13‑billion‑parameter decoder‑only LLM. The authors generate synthetic workloads that vary prefix overlap and request burstiness to stress both cache and network dimensions.
Results & Findings
| Metric | Least‑load | Prefix‑similarity | Prior‑policy proxy | GORGO |
|---|---|---|---|---|
| Median TTFT (ms) | 210 | 185 | 160 | 115 |
| P99 TTFT (ms) | 720 | 650 | 590 | 420 |
| Avg. cache‑hit rate | 38 % | 62 % | 58 % | 55 % |
| Network‑induced stalls (count) | 12 k | 9 k | 7 k | 3 k |
- Latency reduction: GORGO cuts median TTFT by ~45 % and P99 TTFT by ~40 % compared to the naïve least‑load baseline.
- Network awareness matters: Even though GORGO’s cache‑hit rate is slightly lower than the pure prefix‑similarity approach, the overall TTFT improves because it avoids costly cross‑region hops for low‑overlap requests.
- Centralized router efficiency: The GORGO proxy adds < 2 ms overhead per request, far less than the synchronization cost observed in prior centralized designs.
- Robustness to workload spikes: Under bursty traffic, GORGO gracefully shifts load to under‑utilized regions while still respecting cache locality, preventing the “pathological” forwarding loops that hurt the other methods.
Practical Implications
- LLM API providers can integrate a GORGO‑style router to shave off tens of milliseconds per token, directly translating into better SLAs and lower per‑token cost (since compute time is reduced).
- Edge‑aware deployments (e.g., content‑generation services that serve globally) benefit from the ability to keep hot prefixes close to users without sacrificing latency when the cache isn’t helpful.
- Cost optimization – By factoring compute availability into the routing decision, operators can better utilize spare capacity in cheaper regions, potentially lowering cloud spend.
- Simplified architecture – The centralized proxy design means you don’t need a full‑mesh of cache‑synchronization protocols; a lightweight metric store suffices, easing operational complexity.
- Extensibility – The cost function is modular; teams can plug in additional factors such as GPU memory pressure, spot‑instance pricing, or regulatory constraints (e.g., data residency) without redesigning the whole system.
Limitations & Future Work
- Static weight tuning – The α, β, γ coefficients are manually set; adaptive learning (e.g., reinforcement learning) could automatically balance latency vs. cache reuse as traffic patterns evolve.
- Cache granularity – The study assumes prefix granularity at the token level; more sophisticated caching (e.g., sub‑token or semantic embeddings) might further boost hit rates.
- Scalability of the metric store – While the prototype works for a handful of regions, a production‑scale deployment with dozens of replicas may need a more robust distributed state layer.
- Security & privacy – Routing decisions based on request content raise concerns about data leakage across jurisdictions; future work should explore privacy‑preserving metrics.
- Generalization to other model families – The experiments focus on a decoder‑only LLM; applying GORGO to encoder‑decoder or multimodal models may require adjustments to the cost model.
Overall, GORGO demonstrates that a network‑aware, cost‑driven routing layer can unlock substantial latency gains for cross‑region LLM serving, offering a pragmatic path for cloud providers and SaaS platforms to deliver faster, cheaper generative AI experiences.
Authors
- Alessio Ricci Toniolo
- Abinaya Dinesh
- Rome Thorstenson
Paper Information
- arXiv ID: 2602.11688v1
- Categories: cs.NI, cs.DC
- Published: February 12, 2026
- PDF: Download PDF