[Paper] Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

Published: 3 days ago (June 1, 2026 at 03:51 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.01839v1

Overview

The paper introduces ConServe, a new scheduling strategy for large‑language‑model (LLM) agents that treats an entire multi‑turn conversation as the scheduling unit instead of handling each turn separately. By observing concrete, instantly‑available metrics (first‑turn input size and KV‑cache occupancy) the system can decide how to allocate compute resources without needing to predict future decode costs, cutting latency and energy use for real‑world LLM‑driven assistants.

Key Contributions

Conversation‑level scheduling: Shifts the granularity from per‑turn to per‑conversation, exposing a stable two‑phase execution pattern (prefill → long memory‑bound tail).
Observation‑only placement: Eliminates the need for learned models that predict decode‑side cost; decisions are based on directly observable quantities.
ConServe architecture: Implements a high‑throughput “prefiller” for the first turn and pins the remainder of the dialogue to a single decoder, transferring the KV cache only once.
Empirical gains: Achieves a 51 % reduction in 95th‑percentile time‑to‑first‑effective‑token (TTFET) and a 7.5 % improvement in energy efficiency, with additional 22 % energy savings when mapping phases to heterogeneous GPU tiers.
Preserves service guarantees: Maintains last‑turn throughput‑by‑turn (TBT) and service‑level objectives (SLOs) despite the aggressive latency optimizations.

Methodology

Workload Characterization – The authors analyze LLM‑based agents that iteratively call tools and generate text across many turns. They observe that each conversation typically consists of:
- Turn 1 (prefill) – a compute‑heavy step where the model processes the user’s initial prompt.
- Subsequent turns (tail) – a long, memory‑bound phase where the model repeatedly decodes from an already‑filled KV cache.
Scheduling Unit Redesign – By promoting the conversation to the scheduling unit, the irregularities of individual turns disappear, revealing the two‑phase structure.
Observable Metrics – The scheduler only needs:
- Input length of turn 1 (readable from the incoming request).
- Per‑decoder KV occupancy after prefill (directly measurable).
ConServe Implementation –
- Prefiller: A lightweight, high‑throughput GPU instance handles the first‑turn prefill for many concurrent conversations.
- Decoder: Each conversation’s KV cache is transferred once to a dedicated decoder that processes the entire tail, avoiding repeated cache migrations.
- Heterogeneous Tiering (optional): Prefiller runs on cost‑effective GPUs, while decoders run on higher‑performance GPUs for the memory‑bound tail.
Baseline Comparison – The authors compare against a state‑of‑the‑art per‑turn scheduler that predicts decode length, tool latency, and KV growth using a learned model.

Results & Findings

Metric	ConServe	Per‑turn Prediction Baseline
95th‑percentile TTFET (latency to first visible token)	‑51.08 %	—
Energy efficiency (overall)	+7.51 %	—
Additional energy gain with heterogeneous GPU mapping	+22.75 %	—
Last‑turn throughput‑by‑turn (TBT) & SLO compliance	Preserved	—

Interpretation:

By eliminating costly predictions and reducing cache shuffling, ConServe delivers substantially faster first‑token responses, which is critical for interactive agents.
Energy savings stem from both fewer cache transfers and better utilization of GPU tiers (high‑throughput prefiller vs. memory‑bound decoder).
The approach does not degrade the performance of later turns, confirming that the two‑phase abstraction holds across realistic workloads.

Practical Implications

Faster user experiences – Applications like AI assistants, code‑generation bots, or multi‑modal agents can show results to users half as quickly, improving perceived responsiveness.
Cost reduction for cloud providers – Lower energy consumption and better GPU utilization translate into cheaper inference services, especially at scale where thousands of concurrent conversations run.
Simplified infrastructure – Operators no longer need to maintain complex predictive models for scheduling; observable metrics suffice, reducing engineering overhead and model‑drift concerns.
Heterogeneous hardware orchestration – ConServe’s clear separation of compute‑bound and memory‑bound phases makes it easier to map workloads onto mixed‑generation GPU fleets, extending the lifespan of older hardware.
Framework integration – The design can be incorporated into existing LLM serving stacks (e.g., vLLM, TensorRT‑LLM) with modest changes: add a prefiller service and a KV‑cache handoff mechanism.

Limitations & Future Work

Assumption of a clear two‑phase pattern – Workloads with highly dynamic turn lengths or frequent tool‑call interruptions may not fit the prefilling‑tail model cleanly.
KV‑cache transfer overhead – While transferred only once, the handoff still incurs latency; optimizing this path (e.g., zero‑copy or unified memory) is an open area.
Scalability of dedicated decoders – Pinning each conversation to a single decoder could limit concurrency on the decoder tier; future work may explore multiplexing strategies without re‑introducing prediction.
Generalization to other model families – The study focuses on decoder‑only LLMs; extending the approach to encoder‑decoder or retrieval‑augmented models warrants investigation.

Overall, ConServe demonstrates that rethinking the granularity of scheduling—from turn to conversation—can unlock significant performance and efficiency gains without the brittleness of predictive models, offering a pragmatic path forward for production LLM‑based agents.

Authors

Jianru Ding
Ryien Hosseini
Pouya Mahdi Gholami
Mingyuan Xiang
Henry Hoffmann

Paper Information

arXiv ID: 2606.01839v1
Categories: cs.DC, cs.AR, cs.LG
Published: June 1, 2026
PDF: Download PDF

[Paper] Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

[Paper] Streaming Communication in Multi-Agent Reasoning

[Paper] Reinforcement Learning from Rich Feedback with Distributional DAgger

[Paper] Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization