[Paper] FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Published: (June 2, 2026 at 11:49 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.04415v1

Overview

FlexNPU introduces a lightweight, user‑space virtualization layer for Huawei Ascend NPUs that lets AI services dynamically balance the heavy “prefill” and the latency‑critical “decode” phases of large language model (LLM) serving. By interposing on the AscendCL API, FlexNPU can schedule work across multiple physical NPUs without any changes to model code, frameworks, or drivers, delivering near‑zero overhead while boosting throughput and reducing first‑token latency.

Key Contributions

  • Transparent NPU virtualization – a user‑space daemon that intercepts AscendCL calls and presents virtual NPU objects to applications, eliminating the need for code or driver modifications.
  • Phase‑aware scheduling – a runtime that distinguishes prefill (compute‑bound) and decode (memory‑bandwidth/KV‑cache bound) phases and dynamically co‑locates them on the same or different NPUs to exploit complementary resource usage.
  • Dynamic PD (prefill‑decode) co‑location – replaces static disaggregation with a flexible policy that adapts to workload characteristics in real time.
  • Zero‑overhead inference – empirical results show no measurable slowdown compared with direct NPU passthrough, and modest throughput gains in many scenarios.
  • Scalable evaluation – demonstrated on a 384‑card Ascend 910C cluster with real‑world LLMs (DeepSeek‑R1, Qwen2.5‑7B), achieving up to 26 % higher throughput and >92 % reduction in time‑to‑first‑token (TTFT).

Methodology

  1. API Interposition – FlexNPU injects a thin library that wraps every AscendCL function. Calls are forwarded to a per‑device daemon that owns the actual hardware resources.
  2. Virtual Object Management – The daemon creates virtual handles for contexts, streams, and memory buffers, mapping them to physical resources on demand.
  3. Operator Dispatch Engine – Before an operator is launched, FlexNPU inspects the current LLM phase (prefill vs. decode) and selects an appropriate NPU (or set of NPUs) based on a lightweight resource model (compute vs. memory bandwidth).
  4. Dynamic Scheduling Policy – A simple heuristic monitors queue lengths and resource utilization, automatically migrating decode work onto NPUs that are under‑utilized after a prefill burst, and vice‑versa.
  5. Evaluation Setup – Experiments were run on a 384‑card Ascend 910C cluster using the official Huawei Ascend AI framework. Baselines included (a) direct passthrough (no virtualization) and (b) static PD disaggregation (prefill and decode permanently bound to separate NPUs). Metrics captured were throughput (tokens / s), TTFT, and total processing time (TPOT).

Results & Findings

ModelBaselineFlexNPU (vs. static PD)FlexNPU (vs. passthrough)
DeepSeek‑R1 (384‑card)+5.15 % throughput (PD) / +26.33 % (co‑location)No measurable overhead; slight throughput bump in some configs
Qwen2.5‑7BStatic PD co‑location≈ same throughputTTFT ↓ > 92 % while TPOT stays flat
  • Zero‑overhead: The virtualization layer adds < 0.5 % latency, well within measurement noise.
  • Throughput gains: By allowing prefill and decode to share under‑utilized compute units, FlexNPU squeezes extra tokens per second out of a fixed hardware pool.
  • Latency improvement: Decoding can start almost immediately after prefill finishes, cutting first‑token latency dramatically—a critical metric for interactive AI services.

Practical Implications

  • Simplified Deployment – Operators can run LLM services on existing Ascend clusters without rewriting code or re‑compiling models; FlexNPU works as a drop‑in library.
  • Higher Utilization – Data‑center operators can pack more inference jobs onto the same hardware, reducing capital expenditure (CAPEX) and operational cost (OPEX).
  • Responsive AI Applications – Chatbots, code assistants, and search‑augmented generation benefit from the massive TTFT reduction, delivering a smoother user experience.
  • Future‑Proofing – As newer NPUs arrive, the same virtualization approach can be extended, protecting investments in software stacks while enabling more sophisticated scheduling (e.g., multi‑tenant isolation, QoS guarantees).
  • Potential for Cloud Services – Cloud providers offering NPU‑accelerated inference can expose virtual NPU endpoints to tenants, allowing fine‑grained billing based on actual resource consumption rather than static device allocation.

Limitations & Future Work

  • Hardware Specificity – FlexNPU is built for Huawei Ascend CL; porting to other NPU ecosystems (e.g., NVIDIA Tensor Cores, Intel Gaudi) will require new interposition layers.
  • Scheduling Heuristics – The current policy is rule‑based; more advanced models (reinforcement learning, predictive analytics) could further improve phase balancing under highly variable workloads.
  • Security Isolation – While virtualization abstracts devices, stronger isolation mechanisms (e.g., sandboxed memory spaces) are needed for multi‑tenant public cloud scenarios.
  • Scalability Tests – Experiments were limited to a 384‑card cluster; evaluating on larger federated deployments and mixed‑precision workloads remains an open avenue.

FlexNPU demonstrates that transparent NPU virtualization is not just a research curiosity—it’s a practical tool that can make LLM serving faster, cheaper, and easier to manage.

Authors

  • Jiongjiong Gu
  • Jianfeng Wang
  • Zidong Han
  • Yongqiao Wang
  • Pengfei Xia
  • Mingjie Zhang
  • Hong Liu
  • Yuanyi Xia
  • Jiajia Chu
  • Yifeng Tang
  • Hui Zang
  • Xin Yao
  • Qijie Qiu
  • Yuzhao Wang
  • Chuanfei Xu
  • Lin Zhang
  • Zhuonan Lai
  • Hongming Huang
  • Jiawei Qiu
  • Gong Zhang
  • Zhong Ming
  • Weipeng Cao

Paper Information

  • arXiv ID: 2606.04415v1
  • Categories: cs.DC
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »