[Paper] Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving
Source: arXiv - 2606.01839v1
Overview
The paper introduces ConServe, a new scheduling strategy for large‑language‑model (LLM) agents that treats an entire multi‑turn conversation as the scheduling unit instead of handling each turn separately. By observing concrete, instantly‑available metrics (first‑turn input size and KV‑cache occupancy) the system can decide how to allocate compute resources without needing to predict future decode costs, cutting latency and energy use for real‑world LLM‑driven assistants.
Key Contributions
- Conversation‑level scheduling: Shifts the granularity from per‑turn to per‑conversation, exposing a stable two‑phase execution pattern (prefill → long memory‑bound tail).
- Observation‑only placement: Eliminates the need for learned models that predict decode‑side cost; decisions are based on directly observable quantities.
- ConServe architecture: Implements a high‑throughput “prefiller” for the first turn and pins the remainder of the dialogue to a single decoder, transferring the KV cache only once.
- Empirical gains: Achieves a 51 % reduction in 95th‑percentile time‑to‑first‑effective‑token (TTFET) and a 7.5 % improvement in energy efficiency, with additional 22 % energy savings when mapping phases to heterogeneous GPU tiers.
- Preserves service guarantees: Maintains last‑turn throughput‑by‑turn (TBT) and service‑level objectives (SLOs) despite the aggressive latency optimizations.
Methodology
-
Workload Characterization – The authors analyze LLM‑based agents that iteratively call tools and generate text across many turns. They observe that each conversation typically consists of:
- Turn 1 (prefill) – a compute‑heavy step where the model processes the user’s initial prompt.
- Subsequent turns (tail) – a long, memory‑bound phase where the model repeatedly decodes from an already‑filled KV cache.
-
Scheduling Unit Redesign – By promoting the conversation to the scheduling unit, the irregularities of individual turns disappear, revealing the two‑phase structure.
-
Observable Metrics – The scheduler only needs:
- Input length of turn 1 (readable from the incoming request).
- Per‑decoder KV occupancy after prefill (directly measurable).
-
ConServe Implementation –
- Prefiller: A lightweight, high‑throughput GPU instance handles the first‑turn prefill for many concurrent conversations.
- Decoder: Each conversation’s KV cache is transferred once to a dedicated decoder that processes the entire tail, avoiding repeated cache migrations.
- Heterogeneous Tiering (optional): Prefiller runs on cost‑effective GPUs, while decoders run on higher‑performance GPUs for the memory‑bound tail.
-
Baseline Comparison – The authors compare against a state‑of‑the‑art per‑turn scheduler that predicts decode length, tool latency, and KV growth using a learned model.
Results & Findings
| Metric | ConServe | Per‑turn Prediction Baseline |
|---|---|---|
| 95th‑percentile TTFET (latency to first visible token) | ‑51.08 % | — |
| Energy efficiency (overall) | +7.51 % | — |
| Additional energy gain with heterogeneous GPU mapping | +22.75 % | — |
| Last‑turn throughput‑by‑turn (TBT) & SLO compliance | Preserved | — |
Interpretation:
- By eliminating costly predictions and reducing cache shuffling, ConServe delivers substantially faster first‑token responses, which is critical for interactive agents.
- Energy savings stem from both fewer cache transfers and better utilization of GPU tiers (high‑throughput prefiller vs. memory‑bound decoder).
- The approach does not degrade the performance of later turns, confirming that the two‑phase abstraction holds across realistic workloads.
Practical Implications
- Faster user experiences – Applications like AI assistants, code‑generation bots, or multi‑modal agents can show results to users half as quickly, improving perceived responsiveness.
- Cost reduction for cloud providers – Lower energy consumption and better GPU utilization translate into cheaper inference services, especially at scale where thousands of concurrent conversations run.
- Simplified infrastructure – Operators no longer need to maintain complex predictive models for scheduling; observable metrics suffice, reducing engineering overhead and model‑drift concerns.
- Heterogeneous hardware orchestration – ConServe’s clear separation of compute‑bound and memory‑bound phases makes it easier to map workloads onto mixed‑generation GPU fleets, extending the lifespan of older hardware.
- Framework integration – The design can be incorporated into existing LLM serving stacks (e.g., vLLM, TensorRT‑LLM) with modest changes: add a prefiller service and a KV‑cache handoff mechanism.
Limitations & Future Work
- Assumption of a clear two‑phase pattern – Workloads with highly dynamic turn lengths or frequent tool‑call interruptions may not fit the prefilling‑tail model cleanly.
- KV‑cache transfer overhead – While transferred only once, the handoff still incurs latency; optimizing this path (e.g., zero‑copy or unified memory) is an open area.
- Scalability of dedicated decoders – Pinning each conversation to a single decoder could limit concurrency on the decoder tier; future work may explore multiplexing strategies without re‑introducing prediction.
- Generalization to other model families – The study focuses on decoder‑only LLMs; extending the approach to encoder‑decoder or retrieval‑augmented models warrants investigation.
Overall, ConServe demonstrates that rethinking the granularity of scheduling—from turn to conversation—can unlock significant performance and efficiency gains without the brittleness of predictive models, offering a pragmatic path forward for production LLM‑based agents.
Authors
- Jianru Ding
- Ryien Hosseini
- Pouya Mahdi Gholami
- Mingyuan Xiang
- Henry Hoffmann
Paper Information
- arXiv ID: 2606.01839v1
- Categories: cs.DC, cs.AR, cs.LG
- Published: June 1, 2026
- PDF: Download PDF