[Paper] LLM-Driven Intent-Based Privacy-Aware Orchestration Across the Cloud-Edge Continuum
Source: arXiv - 2602.16100v1
Overview
The paper presents a novel system that lets large language model (LLM) inference adapt on‑the‑fly to the ever‑changing mix of workloads and the heterogeneous GPUs that power modern cloud‑edge infrastructures. By enabling “pipeline reconfiguration” with only a few milliseconds of downtime, the authors show it’s possible to keep LLM services responsive even when resources are scarce or workloads shift dramatically.
Key Contributions
- Dynamic pipeline reconfiguration that can swap in new GPU‑specific deployment configurations while an LLM service is running.
- State‑preserving migration technique that moves the massive model parameters and inference state with ≤ 50 ms of service interruption.
- Serverless‑friendly orchestration that integrates with existing function‑as‑a‑service (FaaS) platforms, allowing elastic scaling without manual tuning.
- Empirical evaluation on a heterogeneous GPU fleet (NVIDIA A100 & L40) demonstrating < 10 % overhead on both time‑to‑first‑token (TTFT) and time‑per‑output‑token (TPOT).
Methodology
- Workload Characterization – The system continuously monitors request patterns (e.g., token length, concurrency) and GPU utilization.
- Configuration Catalog – A set of pre‑computed pipeline layouts (batch size, tensor parallelism, quantization level) is maintained for each GPU type.
- Decision Engine – An LLM‑driven policy model predicts the optimal configuration given the current workload and hardware state.
- Live Migration Protocol
- Checkpointing: The current inference state (attention cache, KV‑cache) is snapshot in GPU memory.
- Parameter Streaming: Model weights are streamed to the target GPU using high‑speed PCIe/NVLink links, leveraging compression to reduce bandwidth.
- Warm‑Start: The checkpoint is restored on the new pipeline, and pending requests resume with minimal latency.
- Serverless Integration – The whole flow is wrapped as a serverless function that can be triggered automatically by the orchestration layer, keeping the developer experience familiar.
Results & Findings
| Metric | Baseline (static) | Dynamic Reconfig. | Overhead |
|---|---|---|---|
| Service downtime (migration) | – | 48 ms (avg) | < 0.05 s |
| TTFT | 120 ms | 128 ms | +6.7 % |
| TPOT | 15 ms/token | 16.3 ms/token | +8.7 % |
| GPU utilization (heterogeneous mix) | 68 % | 84 % | +16 % |
- The migration cost stays well below the typical human‑perceived latency threshold (≈ 100 ms).
- Even under bursty request spikes, the system picks a higher‑throughput configuration (e.g., larger batch, lower precision) and reverts when load eases, keeping overall latency stable.
- Heterogeneous hardware is exploited: workloads that fit better on an A100 are automatically shifted there, while lighter jobs stay on cost‑effective L40s.
Practical Implications
- Serverless LLM APIs can now auto‑scale across a mixed GPU pool without developers manually provisioning or re‑configuring containers.
- Cost optimization: By moving low‑priority inference to cheaper GPUs and only promoting to premium A100s when needed, cloud providers can offer tiered pricing with better utilization.
- Edge deployments: The same technique works on edge devices equipped with modest GPUs, enabling on‑device inference that can seamlessly fall back to the cloud when the edge is overloaded.
- Continuous deployment: New model versions or quantization schemes can be rolled out without taking the service offline, reducing downtime for SaaS products that rely on LLMs (e.g., chatbots, code assistants).
Limitations & Future Work
- The approach assumes high‑speed inter‑GPU links (PCIe 4.0/5.0, NVLink); migration latency could increase on slower networks.
- The catalog of pipeline configurations is static; generating optimal configurations on‑the‑fly for unseen hardware remains an open challenge.
- Security and privacy of the streamed model parameters were not the focus—future work could integrate encrypted transfer and attestation.
- Extending the framework to multi‑node, multi‑region orchestration (beyond a single data‑center) is left for subsequent research.
Authors
- Zijie Su
- Muhammed Tawfiqul Islam
- Mohammad Goudarzi
- Adel N. Toosi
Paper Information
- arXiv ID: 2602.16100v1
- Categories: cs.DC
- Published: February 18, 2026
- PDF: Download PDF