[Paper] Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Published: (November 26, 2025 at 09:06 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21413v1

Overview

The paper presents a prototype that lets large‑language‑model (LLM) inference run on a traditional high‑performance‑computing (HPC) cluster by stitching together three popular orchestration tools: vLLM (a fast LLM serving engine), Slurm (the de‑facto batch scheduler for supercomputers), and Kubernetes (the container‑orchestration platform that developers already know). The authors demonstrate that this hybrid stack can handle hundreds to a thousand simultaneous user requests with only ~500 ms added latency, showing that existing HPC resources can be repurposed for low‑latency, user‑facing AI services.

Key Contributions

  • Hybrid orchestration layer – A novel integration of Slurm and Kubernetes that lets HPC nodes be managed like a cloud‑native pool while still respecting the batch‑scheduler’s allocation policies.
  • vLLM‑centric serving – Embedding the high‑throughput vLLM inference engine inside the HPC containers, enabling tensor‑parallelism and model‑sharding across multiple nodes.
  • Dynamic scaling logic – An automated controller that spins up or tears down vLLM pods in response to real‑time request volume, achieving near‑linear scaling for 100, 500, and 1 000 concurrent queries.
  • End‑to‑end benchmark – A quantitative evaluation on the RAMSES supercomputer that measures latency, throughput, and overhead, showing only ~0.5 s extra latency compared with a pure‑cloud deployment.
  • Open‑source reference implementation – The authors release scripts and Helm charts that let other institutions reproduce the setup on their own Slurm‑backed clusters.

Methodology

  1. Environment Setup – The RAMSES HPC cluster (GPU‑enabled nodes) runs Slurm for resource allocation. A lightweight Kubernetes control plane is deployed on a head node, with each compute node running a Kubelet that registers as a Kubernetes worker.
  2. Containerization – The vLLM server, its dependencies, and the LLM model files are packaged into a Docker image. The image is stored in a private registry accessible from the cluster.
  3. Scheduling Bridge – A custom Slurm‑to‑Kubernetes bridge watches the Slurm job queue. When a user submits an inference request (via a REST endpoint), the bridge creates a Slurm job that reserves the required GPUs and then launches a corresponding Kubernetes pod running vLLM.
  4. Dynamic Autoscaling – A controller monitors request latency and queue depth. If the queue grows beyond a threshold, it requests additional Slurm allocations, which in turn spin up more vLLM pods. When demand drops, the pods are gracefully terminated and the GPUs are released back to Slurm.
  5. Benchmarking – The authors generate synthetic workloads of 100, 500, and 1 000 concurrent HTTP requests against a 7B‑parameter LLM. They record end‑to‑end latency (client → API gateway → vLLM → GPU) and compare it to a baseline where vLLM runs on a dedicated Kubernetes cluster without Slurm.

Results & Findings

Concurrent RequestsAvg. End‑to‑End LatencyOverhead vs. Pure‑K8s
100~1.2 s+0.45 s
500~1.4 s+0.48 s
1 000~1.6 s+0.52 s
  • Scalability – Latency grows sub‑linearly; the system can sustain a thousand simultaneous queries without saturating the network or GPU memory.
  • Overhead – The extra ~500 ms comes mainly from the Slurm‑Kubernetes handoff and container startup; once pods are warm, the per‑request cost is comparable to a native cloud deployment.
  • Resource Utilization – The hybrid scheduler achieves >85 % GPU utilization during peak load, far better than static allocation strategies often used on HPC clusters.

Practical Implications

  • Leverage Existing HPC Investments – Universities and research labs can expose their GPU‑rich supercomputers to AI‑driven web services without buying dedicated inference clusters.
  • Cost‑Effective AI SaaS – By reusing batch‑scheduled resources, organizations can offer low‑latency LLM APIs to students, researchers, or internal tools while paying only for actual usage.
  • Developer‑Friendly Ops – The Kubernetes façade means developers can use familiar kubectl, Helm charts, and CI/CD pipelines, while HPC admins retain control through Slurm policies (fair‑share, quotas, accounting).
  • Hybrid Cloud Edge Cases – The pattern can be extended to burst AI workloads to the cloud when the on‑prem HPC queue is full, creating a seamless on‑prem/off‑prem inference continuum.
  • Standardization Path – The bridge code and Helm charts could become a reference implementation for other HPC centers looking to expose AI services, accelerating the adoption of “HPC‑as‑a‑service” for AI.

Limitations & Future Work

  • Model Size Boundaries – The evaluation used a 7B‑parameter model; scaling to 70B‑plus models may hit memory limits on current GPU nodes and require more sophisticated tensor‑parallel strategies.
  • Cold‑Start Penalty – The ~500 ms overhead is dominated by pod startup and model loading; caching warmed pods or using a “warm pool” could reduce this further.
  • Security & Multi‑Tenant Isolation – The current prototype assumes a trusted internal network; extending the design to multi‑tenant public APIs would need stronger isolation (e.g., pod security policies, network sandboxing).
  • Scheduler Complexity – Maintaining two schedulers (Slurm + Kubernetes) adds operational overhead; future work could explore tighter integration or a unified API layer.
  • Broader Benchmarks – Real‑world workloads (e.g., retrieval‑augmented generation, multi‑modal inference) and heterogeneous hardware (TPUs, newer GPUs) remain to be tested.
Back to Blog

Related posts

Read more »

It’s code red for ChatGPT

A smidge over three years ago, OpenAI threw the rest of the tech industry into chaos. When ChatGPT launched, even billed as a 'low-key research preview,' it bec...