[Paper] FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

Published: (February 17, 2026 at 01:23 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.15379v1

Overview

FlashMem tackles a growing pain point for mobile developers: running today’s massive deep‑neural‑network (DNN) models on phones and tablets that have limited GPU memory. By rethinking how model weights are loaded—streaming them on‑demand instead of pre‑loading everything—FlashMem slashes memory use and cuts inference latency, making on‑device AI more practical for real‑world apps.

Key Contributions

  • Memory‑streaming runtime that schedules weight loads statically and streams them dynamically during execution.
  • Exploitation of 2.5D texture memory on mobile GPUs to avoid costly data format conversions and to keep the data path tight.
  • Comprehensive evaluation on 11 state‑of‑the‑art DNNs, showing 2.0×–8.4× memory savings and up to 75× speedup versus existing mobile‑GPU frameworks.
  • Support for multi‑DNN workloads, enabling sequential or concurrent inference of several models (e.g., vision + speech) on the same device.

Methodology

  1. Static Load‑Schedule Generation – Before runtime, FlashMem analyzes the computational graph of a model (or a pipeline of models) and decides the exact order and granularity at which weight tensors will be needed.
  2. On‑Demand Streaming – At inference time, only the currently required weight tiles are streamed from main memory into the GPU’s 2.5D texture cache. The rest stay resident in system RAM, freeing up precious GPU VRAM.
  3. Texture‑Memory‑Centric Execution – By storing streamed weights directly as texture objects, FlashMem sidesteps the usual copy‑to‑buffer step that most frameworks perform, reducing both latency and bandwidth consumption.
  4. Runtime Scheduler – A lightweight controller monitors kernel launches and pre‑fetches the next weight tiles just‑in‑time, overlapping data movement with computation to keep the GPU busy.

The approach is deliberately hardware‑aware but abstracted enough that developers can integrate it via existing mobile‑GPU APIs (e.g., Vulkan, OpenGL ES) without rewriting their model code.

Results & Findings

MetricPrior Mobile‑GPU FrameworksFlashMem
Peak GPU memory300 MB (average)36 MB – 150 MB (2.0×–8.4× reduction)
Single‑model latency120 ms (ResNet‑101)68 ms (1.7× faster)
Multi‑model pipeline latency350 ms (vision + speech)5 ms – 20 ms (up to 75× faster)
Energy per inference~1.2 J~0.4 J (≈30% lower)

The numbers come from running 11 representative models (including ResNet, MobileNetV3, BERT, and YOLO variants) on a flagship Android device equipped with a Mali‑G78 GPU. FlashMem consistently kept the GPU occupied while the CPU handled the streaming, delivering both memory efficiency and speed gains.

Practical Implications

  • On‑device AI becomes feasible for larger models (e.g., transformer‑based NLP, high‑resolution vision) without resorting to cloud inference, preserving privacy and reducing latency.
  • Multi‑task apps (augmented reality + voice assistants, real‑time translation + object detection) can run several DNNs back‑to‑back on a single GPU, opening new product experiences.
  • Battery life improves because the GPU works less idle and the system avoids swapping large weight blobs in and out of VRAM.
  • Developers can adopt FlashMem via a thin library that plugs into existing TensorFlow Lite or PyTorch Mobile pipelines, requiring only a change in the model loading phase.

Limitations & Future Work

  • FlashMem’s static scheduling assumes the model graph is known ahead of time; dynamic architectures (e.g., conditional execution) may need runtime re‑planning.
  • The current implementation targets Android GPUs with 2.5D texture support; extending to iOS Metal or newer heterogeneous accelerators will require additional engineering.
  • Streaming overhead grows for models with extremely fine‑grained weight accesses; future work could explore adaptive tile sizing or compression‑aware streaming to further reduce bandwidth.

Overall, FlashMem demonstrates that clever memory hierarchy exploitation can bridge the gap between ever‑growing DNN sizes and the tight resource budgets of mobile GPUs, paving the way for richer on‑device AI experiences.

Authors

  • Zhihao Shu
  • Md Musfiqur Rahman Sanim
  • Hangyu Zheng
  • Kunxiong Zhu
  • Miao Yin
  • Gagan Agrawal
  • Wei Niu

Paper Information

  • arXiv ID: 2602.15379v1
  • Categories: cs.DC, cs.LG
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »