[Paper] CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control

Published: (January 30, 2026 at 03:27 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.22705v1

Overview

The paper introduces CONCUR, a lightweight control layer that dramatically boosts the throughput of large‑language‑model (LLM) inference when serving agentic workloads (e.g., autonomous agents, tool‑using bots). By treating the GPU key‑value (KV) cache as a shared, congestion‑prone resource, CONCUR dynamically throttles the number of active agents to avoid “middle‑phase thrashing,” a cache‑efficiency collapse that typically drags performance down long before the GPU runs out of memory.

Key Contributions

  • Identification of middle‑phase thrashing – a previously undocumented phenomenon where long‑running agents gradually fill the KV cache, causing a steep drop in cache hit rates and throughput.
  • Agent‑level admission control – a shift from reactive per‑request cache eviction to proactive regulation of how many agents may run concurrently.
  • CONCUR control algorithm – a simple, feedback‑driven loop that monitors cache pressure (e.g., hit‑rate, occupancy) and adjusts the active‑agent count in real time.
  • Compatibility with existing serving stacks – CONCUR sits on top of popular LLM serving frameworks (e.g., vLLM, TensorRT‑LLM) without requiring model changes or heavyweight kernel modifications.
  • Empirical gains on real‑world models – up to 4.09× throughput improvement on Qwen3‑32B and 1.9× on DeepSeek‑V3 across diverse agent workloads.

Methodology

  1. Workload Characterization

    • Collected traces from several open‑source and commercial agentic applications (code‑generation bots, web‑search agents, multi‑step planners).
    • Measured KV‑cache occupancy, hit‑rate, and per‑token latency over the lifetime of each agent.
  2. Middle‑Phase Thrashing Diagnosis

    • Observed that after an initial “warm‑up” phase, the cache hit‑rate drops sharply as agents accumulate long KV histories, even though the GPU still has free memory.
    • Named this degradation “middle‑phase thrashing.”
  3. Control‑Theoretic Design

    • Modeled the KV cache as a shared resource similar to a network link.
    • Designed a congestion‑control‑style feedback loop:
      • Signal – cache pressure metric (e.g., ratio of new KV entries to total capacity, or recent hit‑rate).
      • Controller – a proportional‑integral (PI) regulator that computes a target agent budget (max concurrent agents).
      • Actuator – admission gate that stalls or launches agents to keep the active count near the target.
  4. Implementation

    • Integrated CONCUR into the request scheduler of a standard LLM serving system.
    • Added lightweight instrumentation to expose cache metrics to the controller.
    • No changes to model weights, tokenizers, or the underlying CUDA kernels.
  5. Evaluation

    • Benchmarked on two 80‑GB A100 GPUs with 32‑bit and 16‑bit quantized versions of Qwen3‑32B and DeepSeek‑V3.
    • Compared three baselines: (i) naïve batch inference, (ii) static max‑batch size, (iii) reactive cache eviction.

Results & Findings

ModelBaseline Throughput (tokens/s)CONCUR ThroughputSpeed‑up
Qwen3‑32B12.350.44.09×
DeepSeek‑V38.716.51.9×
  • Cache hit‑rate stability: With CONCUR, hit‑rates stayed above 85 % throughout long runs, whereas the baseline fell below 40 % after ~30 seconds of continuous inference.
  • Latency tail reduction: 99th‑percentile per‑token latency dropped from 180 ms to 45 ms on Qwen3‑32B.
  • Memory usage: Peak KV memory remained within 70 % of GPU capacity, confirming that throughput gains stem from better cache reuse, not from simply fitting more data.
  • Scalability: The control loop adds <0.5 ms overhead per scheduling decision, negligible compared with token generation time.

Practical Implications

  • Higher ROI on existing GPU fleets – Companies can squeeze up to 4× more inference throughput from the same hardware, delaying costly upgrades.
  • More responsive agents – Lower tail latency means multi‑step agents (e.g., planning‑then‑acting loops) can finish tasks faster, improving user experience in chat assistants, code‑completion tools, and autonomous agents.
  • Simplified ops – Since CONCUR works as a plug‑in to existing serving stacks, DevOps teams can adopt it without retraining models or rewriting inference pipelines.
  • Cost‑effective scaling in the cloud – Cloud providers can offer higher‑throughput LLM endpoints at the same price tier, or charge premium for “high‑throughput agentic” instances.
  • Enables richer agentic behaviors – Developers can safely increase the number of parallel agents (e.g., per‑user bots) without fearing cache thrashing, opening the door to large‑scale multi‑agent simulations and collaborative AI systems.

Limitations & Future Work

  • Cache‑metric selection – The current controller relies on a single aggregated pressure signal; more nuanced metrics (e.g., per‑agent KV growth patterns) could improve precision.
  • Workload diversity – Experiments focused on two 32‑B models; scaling to even larger models (e.g., 70‑B+) or mixed‑precision pipelines may expose new bottlenecks.
  • Distributed inference – CONCUR is designed for a single GPU’s KV cache; extending the control logic across multi‑GPU or multi‑node deployments remains an open challenge.
  • Theoretical guarantees – While the PI controller works well empirically, formal stability analysis under highly bursty request arrivals is left for future research.

Overall, CONCUR demonstrates that borrowing ideas from congestion control can unlock substantial performance gains for modern LLM agents, offering a pragmatic path for developers to deliver faster, more scalable AI services.

Authors

  • Qiaoling Chen
  • Zhisheng Ye
  • Tian Tang
  • Peng Sun
  • Boyu Tian
  • Guoteng Wang
  • Shenggui Li
  • Yonggang Wen
  • Zhenhua Han
  • Tianwei Zhang

Paper Information

  • arXiv ID: 2601.22705v1
  • Categories: cs.DC
  • Published: January 30, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »