[Paper] FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Published: 1 week ago (June 2, 2026 at 11:49 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.04415v1

Overview

FlexNPU introduces a lightweight, user‑space virtualization layer for Huawei Ascend NPUs that lets AI services dynamically balance the heavy “prefill” and the latency‑critical “decode” phases of large language model (LLM) serving. By interposing on the AscendCL API, FlexNPU can schedule work across multiple physical NPUs without any changes to model code, frameworks, or drivers, delivering near‑zero overhead while boosting throughput and reducing first‑token latency.

Key Contributions

Transparent NPU virtualization – a user‑space daemon that intercepts AscendCL calls and presents virtual NPU objects to applications, eliminating the need for code or driver modifications.
Phase‑aware scheduling – a runtime that distinguishes prefill (compute‑bound) and decode (memory‑bandwidth/KV‑cache bound) phases and dynamically co‑locates them on the same or different NPUs to exploit complementary resource usage.
Dynamic PD (prefill‑decode) co‑location – replaces static disaggregation with a flexible policy that adapts to workload characteristics in real time.
Zero‑overhead inference – empirical results show no measurable slowdown compared with direct NPU passthrough, and modest throughput gains in many scenarios.
Scalable evaluation – demonstrated on a 384‑card Ascend 910C cluster with real‑world LLMs (DeepSeek‑R1, Qwen2.5‑7B), achieving up to 26 % higher throughput and >92 % reduction in time‑to‑first‑token (TTFT).

Methodology

API Interposition – FlexNPU injects a thin library that wraps every AscendCL function. Calls are forwarded to a per‑device daemon that owns the actual hardware resources.
Virtual Object Management – The daemon creates virtual handles for contexts, streams, and memory buffers, mapping them to physical resources on demand.
Operator Dispatch Engine – Before an operator is launched, FlexNPU inspects the current LLM phase (prefill vs. decode) and selects an appropriate NPU (or set of NPUs) based on a lightweight resource model (compute vs. memory bandwidth).
Dynamic Scheduling Policy – A simple heuristic monitors queue lengths and resource utilization, automatically migrating decode work onto NPUs that are under‑utilized after a prefill burst, and vice‑versa.
Evaluation Setup – Experiments were run on a 384‑card Ascend 910C cluster using the official Huawei Ascend AI framework. Baselines included (a) direct passthrough (no virtualization) and (b) static PD disaggregation (prefill and decode permanently bound to separate NPUs). Metrics captured were throughput (tokens / s), TTFT, and total processing time (TPOT).

Results & Findings

Model	Baseline	FlexNPU (vs. static PD)	FlexNPU (vs. passthrough)
DeepSeek‑R1 (384‑card)	–	+5.15 % throughput (PD) / +26.33 % (co‑location)	No measurable overhead; slight throughput bump in some configs
Qwen2.5‑7B	Static PD co‑location	≈ same throughput	TTFT ↓ > 92 % while TPOT stays flat

Zero‑overhead: The virtualization layer adds < 0.5 % latency, well within measurement noise.
Throughput gains: By allowing prefill and decode to share under‑utilized compute units, FlexNPU squeezes extra tokens per second out of a fixed hardware pool.
Latency improvement: Decoding can start almost immediately after prefill finishes, cutting first‑token latency dramatically—a critical metric for interactive AI services.

Practical Implications

Simplified Deployment – Operators can run LLM services on existing Ascend clusters without rewriting code or re‑compiling models; FlexNPU works as a drop‑in library.
Higher Utilization – Data‑center operators can pack more inference jobs onto the same hardware, reducing capital expenditure (CAPEX) and operational cost (OPEX).
Responsive AI Applications – Chatbots, code assistants, and search‑augmented generation benefit from the massive TTFT reduction, delivering a smoother user experience.
Future‑Proofing – As newer NPUs arrive, the same virtualization approach can be extended, protecting investments in software stacks while enabling more sophisticated scheduling (e.g., multi‑tenant isolation, QoS guarantees).
Potential for Cloud Services – Cloud providers offering NPU‑accelerated inference can expose virtual NPU endpoints to tenants, allowing fine‑grained billing based on actual resource consumption rather than static device allocation.

Limitations & Future Work

Hardware Specificity – FlexNPU is built for Huawei Ascend CL; porting to other NPU ecosystems (e.g., NVIDIA Tensor Cores, Intel Gaudi) will require new interposition layers.
Scheduling Heuristics – The current policy is rule‑based; more advanced models (reinforcement learning, predictive analytics) could further improve phase balancing under highly variable workloads.
Security Isolation – While virtualization abstracts devices, stronger isolation mechanisms (e.g., sandboxed memory spaces) are needed for multi‑tenant public cloud scenarios.
Scalability Tests – Experiments were limited to a 384‑card cluster; evaluating on larger federated deployments and mixed‑precision workloads remains an open avenue.

FlexNPU demonstrates that transparent NPU virtualization is not just a research curiosity—it’s a practical tool that can make LLM serving faster, cheaper, and easier to manage.

Authors

Jiongjiong Gu
Jianfeng Wang
Zidong Han
Yongqiao Wang
Pengfei Xia
Mingjie Zhang
Hong Liu
Yuanyi Xia
Jiajia Chu
Yifeng Tang
Hui Zang
Xin Yao
Qijie Qiu
Yuzhao Wang
Chuanfei Xu
Lin Zhang
Zhuonan Lai
Hongming Huang
Jiawei Qiu
Gong Zhang
Zhong Ming
Weipeng Cao

Paper Information

arXiv ID: 2606.04415v1
Categories: cs.DC
Published: June 3, 2026
PDF: Download PDF

[Paper] FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

[Paper] Predictive Autoscaling in Cloud-Native and Federated Cloud-Edge Computing Environments: A Taxonomy and Future Directions

[Paper] PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

[Paper] Mission-Level Runtime Assurance Framework for Autonomous Driving