[Paper] FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

Published: 2 months ago (February 17, 2026 at 01:23 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15379v1

Overview

FlashMem tackles a growing pain point for mobile developers: running today’s massive deep‑neural‑network (DNN) models on phones and tablets that have limited GPU memory. By rethinking how model weights are loaded—streaming them on‑demand instead of pre‑loading everything—FlashMem slashes memory use and cuts inference latency, making on‑device AI more practical for real‑world apps.

Key Contributions

Memory‑streaming runtime that schedules weight loads statically and streams them dynamically during execution.
Exploitation of 2.5D texture memory on mobile GPUs to avoid costly data format conversions and to keep the data path tight.
Comprehensive evaluation on 11 state‑of‑the‑art DNNs, showing 2.0×–8.4× memory savings and up to 75× speedup versus existing mobile‑GPU frameworks.
Support for multi‑DNN workloads, enabling sequential or concurrent inference of several models (e.g., vision + speech) on the same device.

Methodology

Static Load‑Schedule Generation – Before runtime, FlashMem analyzes the computational graph of a model (or a pipeline of models) and decides the exact order and granularity at which weight tensors will be needed.
On‑Demand Streaming – At inference time, only the currently required weight tiles are streamed from main memory into the GPU’s 2.5D texture cache. The rest stay resident in system RAM, freeing up precious GPU VRAM.
Texture‑Memory‑Centric Execution – By storing streamed weights directly as texture objects, FlashMem sidesteps the usual copy‑to‑buffer step that most frameworks perform, reducing both latency and bandwidth consumption.
Runtime Scheduler – A lightweight controller monitors kernel launches and pre‑fetches the next weight tiles just‑in‑time, overlapping data movement with computation to keep the GPU busy.

The approach is deliberately hardware‑aware but abstracted enough that developers can integrate it via existing mobile‑GPU APIs (e.g., Vulkan, OpenGL ES) without rewriting their model code.

Results & Findings

Metric	Prior Mobile‑GPU Frameworks	FlashMem
Peak GPU memory	300 MB (average)	36 MB – 150 MB (2.0×–8.4× reduction)
Single‑model latency	120 ms (ResNet‑101)	68 ms (1.7× faster)
Multi‑model pipeline latency	350 ms (vision + speech)	5 ms – 20 ms (up to 75× faster)
Energy per inference	~1.2 J	~0.4 J (≈30% lower)

The numbers come from running 11 representative models (including ResNet, MobileNetV3, BERT, and YOLO variants) on a flagship Android device equipped with a Mali‑G78 GPU. FlashMem consistently kept the GPU occupied while the CPU handled the streaming, delivering both memory efficiency and speed gains.

Practical Implications

On‑device AI becomes feasible for larger models (e.g., transformer‑based NLP, high‑resolution vision) without resorting to cloud inference, preserving privacy and reducing latency.
Multi‑task apps (augmented reality + voice assistants, real‑time translation + object detection) can run several DNNs back‑to‑back on a single GPU, opening new product experiences.
Battery life improves because the GPU works less idle and the system avoids swapping large weight blobs in and out of VRAM.
Developers can adopt FlashMem via a thin library that plugs into existing TensorFlow Lite or PyTorch Mobile pipelines, requiring only a change in the model loading phase.

Limitations & Future Work

FlashMem’s static scheduling assumes the model graph is known ahead of time; dynamic architectures (e.g., conditional execution) may need runtime re‑planning.
The current implementation targets Android GPUs with 2.5D texture support; extending to iOS Metal or newer heterogeneous accelerators will require additional engineering.
Streaming overhead grows for models with extremely fine‑grained weight accesses; future work could explore adaptive tile sizing or compression‑aware streaming to further reduce bandwidth.

Overall, FlashMem demonstrates that clever memory hierarchy exploitation can bridge the gap between ever‑growing DNN sizes and the tight resource budgets of mobile GPUs, paving the way for richer on‑device AI experiences.

Authors

Zhihao Shu
Md Musfiqur Rahman Sanim
Hangyu Zheng
Kunxiong Zhu
Miao Yin
Gagan Agrawal
Wei Niu

Paper Information

arXiv ID: 2602.15379v1
Categories: cs.DC, cs.LG
Published: February 17, 2026
PDF: Download PDF

[Paper] FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

OpenAI Calls In the Consultants For Its Enterprise Push

Google clamps down on Antigravity 'malicious usage', cutting off OpenClaw users in sweeping ToS enforcement move

Anthropic: Chinese AI firms created 24,000 fraudulent accounts for distillation attacks

One engineer made a production SaaS product in an hour: here's the governance system that made it possible