[Paper] LLM-Driven Intent-Based Privacy-Aware Orchestration Across the Cloud-Edge Continuum

Published: 3 days ago (February 17, 2026 at 07:09 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16100v1

Overview

The paper presents a novel system that lets large language model (LLM) inference adapt on‑the‑fly to the ever‑changing mix of workloads and the heterogeneous GPUs that power modern cloud‑edge infrastructures. By enabling “pipeline reconfiguration” with only a few milliseconds of downtime, the authors show it’s possible to keep LLM services responsive even when resources are scarce or workloads shift dramatically.

Key Contributions

Dynamic pipeline reconfiguration that can swap in new GPU‑specific deployment configurations while an LLM service is running.
State‑preserving migration technique that moves the massive model parameters and inference state with ≤ 50 ms of service interruption.
Serverless‑friendly orchestration that integrates with existing function‑as‑a‑service (FaaS) platforms, allowing elastic scaling without manual tuning.
Empirical evaluation on a heterogeneous GPU fleet (NVIDIA A100 & L40) demonstrating < 10 % overhead on both time‑to‑first‑token (TTFT) and time‑per‑output‑token (TPOT).

Methodology

Workload Characterization – The system continuously monitors request patterns (e.g., token length, concurrency) and GPU utilization.
Configuration Catalog – A set of pre‑computed pipeline layouts (batch size, tensor parallelism, quantization level) is maintained for each GPU type.
Decision Engine – An LLM‑driven policy model predicts the optimal configuration given the current workload and hardware state.
Live Migration Protocol
- Checkpointing: The current inference state (attention cache, KV‑cache) is snapshot in GPU memory.
- Parameter Streaming: Model weights are streamed to the target GPU using high‑speed PCIe/NVLink links, leveraging compression to reduce bandwidth.
- Warm‑Start: The checkpoint is restored on the new pipeline, and pending requests resume with minimal latency.
Serverless Integration – The whole flow is wrapped as a serverless function that can be triggered automatically by the orchestration layer, keeping the developer experience familiar.

Results & Findings

Metric	Baseline (static)	Dynamic Reconfig.	Overhead
Service downtime (migration)	–	48 ms (avg)	< 0.05 s
TTFT	120 ms	128 ms	+6.7 %
TPOT	15 ms/token	16.3 ms/token	+8.7 %
GPU utilization (heterogeneous mix)	68 %	84 %	+16 %

The migration cost stays well below the typical human‑perceived latency threshold (≈ 100 ms).
Even under bursty request spikes, the system picks a higher‑throughput configuration (e.g., larger batch, lower precision) and reverts when load eases, keeping overall latency stable.
Heterogeneous hardware is exploited: workloads that fit better on an A100 are automatically shifted there, while lighter jobs stay on cost‑effective L40s.

Practical Implications

Serverless LLM APIs can now auto‑scale across a mixed GPU pool without developers manually provisioning or re‑configuring containers.
Cost optimization: By moving low‑priority inference to cheaper GPUs and only promoting to premium A100s when needed, cloud providers can offer tiered pricing with better utilization.
Edge deployments: The same technique works on edge devices equipped with modest GPUs, enabling on‑device inference that can seamlessly fall back to the cloud when the edge is overloaded.
Continuous deployment: New model versions or quantization schemes can be rolled out without taking the service offline, reducing downtime for SaaS products that rely on LLMs (e.g., chatbots, code assistants).

Limitations & Future Work

The approach assumes high‑speed inter‑GPU links (PCIe 4.0/5.0, NVLink); migration latency could increase on slower networks.
The catalog of pipeline configurations is static; generating optimal configurations on‑the‑fly for unseen hardware remains an open challenge.
Security and privacy of the streamed model parameters were not the focus—future work could integrate encrypted transfer and attestation.
Extending the framework to multi‑node, multi‑region orchestration (beyond a single data‑center) is left for subsequent research.

Authors

Zijie Su
Muhammed Tawfiqul Islam
Mohammad Goudarzi
Adel N. Toosi

Paper Information

arXiv ID: 2602.16100v1
Categories: cs.DC
Published: February 18, 2026
PDF: Download PDF

[Paper] LLM-Driven Intent-Based Privacy-Aware Orchestration Across the Cloud-Edge Continuum

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploring Novel Data Storage Approaches for Large-Scale Numerical Weather Prediction

[Paper] TopoSZp: Lightweight Topology-Aware Error-controlled Compression for Scientific Data

[Paper] Informative Trains: A Memory-Efficient Journey to a Self-Stabilizing Leader Election Algorithm in Anonymous Graphs

[Paper] Do GPUs Really Need New Tabular File Formats?