[Paper] VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
Source: arXiv - 2605.06068v1
Overview
The paper introduces VibeServe, an AI‑driven system that automatically builds custom LLM serving stacks instead of relying on a one‑size‑fits‑all infrastructure. By treating the design of a serving pipeline as a search problem solved by cooperating agents, VibeServe can generate, verify, and benchmark bespoke deployments that match the quirks of a given model, workload, or hardware platform.
Key Contributions
- Agentic design loop – A two‑level loop (planner + executor agents) that synthesizes full serving stacks from scratch, including code, configuration, and deployment scripts.
- Correctness‑first verification – Generated components are automatically unit‑tested and performance‑profiled before being accepted.
- Competitive baseline performance – In standard, highly‑optimized scenarios VibeServe matches the state‑of‑the‑art vLLM runtime.
- Specialized gains in non‑standard settings – Demonstrates up to 2× speed‑up or memory savings on six workloads involving exotic model architectures, workload‑aware batching, and hardware‑specific kernels.
- Open‑source implementation – Full codebase and reproducible benchmarks released on GitHub.
Methodology
- Problem framing – Treat LLM serving as a combinatorial design space (choice of tokenizer, inference engine, batching strategy, GPU/CPU placement, etc.).
- Outer planning loop – A high‑level LLM (the “architect”) proposes a candidate stack description, maintains a task graph, and tracks explored designs.
- Inner implementation loop – For each proposal, a second LLM (the “builder”) writes the necessary code/configuration, runs automated unit tests, and executes a micro‑benchmark on the target hardware.
- Feedback & pruning – Results (correctness, latency, throughput, memory) are fed back to the planner, which discards failing designs and iteratively refines the search.
- Evaluation – The system is benchmarked against vLLM on a standard deployment (single‑GPU, GPT‑2‑like model) and on six “non‑standard” cases (e.g., mixture‑of‑experts models, quantized weights, multi‑node inference, custom tokenizers).
The approach is deliberately lightweight: the agents are prompted with concise design templates and rely on existing open‑source libraries (PyTorch, Triton, FastAPI) rather than building a new runtime from scratch.
Results & Findings
| Scenario | Baseline (vLLM) | VibeServe | Speed‑up / Memory Δ |
|---|---|---|---|
| Standard single‑GPU GPT‑2 | 120 tokens/s | 118 tokens/s | –1 % |
| MoE model with expert routing | 45 tokens/s | 78 tokens/s | +73 % |
| 8‑bit quantized LLaMA | 90 tokens/s | 112 tokens/s | +24 % |
| Multi‑node inference (2 GPUs) | 210 tokens/s | 260 tokens/s | +24 % |
| Custom tokenizer + streaming API | 55 tokens/s | 92 tokens/s | +67 % |
| GPU‑specific kernel (TensorRT) | 130 tokens/s | 165 tokens/s | +27 % |
Key takeaways
- No regression on well‑tuned standard workloads.
- Significant gains when the workload deviates from the assumptions baked into generic stacks (e.g., non‑uniform batch sizes, mixed‑precision, or hardware‑specific kernels).
- The generated stacks remain correct (all functional tests passed) and portable (deployable with Docker or Kubernetes).
Practical Implications
- Rapid prototyping – Teams can spin up a production‑grade serving pipeline for a new model in minutes, without hand‑tuning low‑level kernels.
- Cost optimization – By automatically selecting the most efficient batching and quantization strategy for the target hardware, cloud spend can be reduced, especially for edge or multi‑tenant deployments.
- Hardware‑aware innovation – Companies building custom ASICs or leveraging emerging GPUs can let VibeServe discover the best way to map their models, accelerating time‑to‑market.
- Reduced ops burden – The agentic loop abstracts away the “infrastructure plumbing,” allowing ML engineers to focus on model improvements rather than serving engineering.
- Extensible ecosystem – Because VibeServe produces standard artifacts (Dockerfiles, config files, Python modules), existing CI/CD pipelines can ingest them unchanged.
Limitations & Future Work
- Search overhead – The generation and benchmarking phase can take several minutes to hours, which may be prohibitive for ultra‑fast iteration cycles.
- Reliance on LLM correctness – Mis‑generated code can slip through if tests are insufficiently comprehensive; stronger formal verification is needed.
- Scope of supported components – Currently limited to PyTorch‑based backends and a handful of hardware accelerators; extending to JAX, ONNX Runtime, or FPGA toolchains remains future work.
- Scalability to massive clusters – The paper evaluates up to two GPUs; handling large‑scale multi‑node clusters will require more sophisticated planning heuristics.
The authors plan to integrate reinforcement learning‑based reward shaping for faster convergence, broaden the library of hardware backends, and explore “continuous serving” where the system adapts the stack on‑the‑fly as workload patterns evolve.
Authors
- Keisuke Kamahori
- Shihang Li
- Simon Peter
- Baris Kasikci
Paper Information
- arXiv ID: 2605.06068v1
- Categories: cs.AI, cs.DC
- Published: May 7, 2026
- PDF: Download PDF