[Paper] Multi-DNN Inference of Sparse Models on Edge SoCs

Published: (March 10, 2026 at 09:16 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.09642v1

Overview

Edge devices are now expected to run several deep‑neural‑network (DNN) models simultaneously—think vision, audio, and sensor‑fusion pipelines all on the same chip. The paper “Multi‑DNN Inference of Sparse Models on Edge SoCs” tackles a practical bottleneck: existing runtimes can only pick a single (or a few) sparse variants of each model, which forces sub‑optimal placement on heterogeneous accelerators and leads to missed Service‑Level Objectives (SLOs). The authors propose model stitching, a way to recombine sub‑graphs from existing sparse models on‑the‑fly, and they demonstrate it with a prototype called SparseLoom that runs on real edge System‑on‑Chips (SoCs).

Key Contributions

  • Model Stitching Concept – Introduces a lightweight, training‑free technique to create new sparse model variants by re‑using sub‑graphs from a pool of pre‑pruned models.
  • SparseLoom Runtime – An end‑to‑end system that integrates model stitching with a scheduler aware of heterogeneous compute units (CPU, GPU, DSP, NPU).
  • SLO‑Driven Allocation – Extends multi‑DNN scheduling to consider per‑task latency budgets, dramatically reducing deadline misses.
  • Comprehensive Evaluation – Shows up to 74 % reduction in SLO violations, 2.31× throughput boost, and 28 % average memory savings versus the best‑available multi‑DNN inference frameworks.
  • Open‑Source Artefacts – The authors release code and a benchmark suite, enabling reproducibility and rapid adoption by the community.

Methodology

  1. Sparse Model Pool – The authors start with a collection of sparsified versions of each DNN (e.g., 70 % and 90 % weight pruning).
  2. Graph Partitioning – Each model is broken into logical sub‑graphs (layers or blocks) that can be independently executed.
  3. Stitching Engine – At runtime, SparseLoom selects compatible sub‑graphs from different sparsity levels to assemble a stitched model that meets a target memory/latency budget. No additional training or fine‑tuning is required because the sub‑graphs share the same architecture and weight layout.
  4. Heterogeneous Scheduler – The stitched model is then mapped onto the SoC’s heterogeneous compute units using a cost model that accounts for accelerator‑specific sparsity support, memory bandwidth, and per‑task SLOs.
  5. Evaluation Platform – Experiments run on popular edge SoCs (e.g., Qualcomm Snapdragon, NVIDIA Jetson) using realistic multi‑DNN workloads (object detection + speech recognition + pose estimation). Baselines include TVM‑based multi‑DNN runtimes and hand‑crafted static model selections.

Results & Findings

MetricSparseLoom vs. Baseline
SLO Violation Rate↓ 74 % (max)
Throughput (inferences / s)↑ 2.31×
Memory Footprint↓ 28 % on average
Latency per TaskMeets 95 % of SLOs vs. 68 % for baseline
Scheduler Overhead< 5 ms per scheduling decision (negligible)

The gains stem mainly from two factors: (1) the ability to pick a just‑right sparsity level for each sub‑graph, avoiding the “one‑size‑fits‑all” penalty of static models, and (2) better accelerator utilisation because the scheduler can place denser sub‑graphs on faster units while keeping ultra‑sparse parts on memory‑constrained cores.

Practical Implications

  • Dynamic Edge Pipelines – Developers can now build modular inference pipelines (e.g., add a new sensor model) without manually re‑pruning or re‑training each variant.
  • Reduced Firmware Footprint – Since stitched models are assembled from existing binaries, firmware size stays low—critical for OTA updates on constrained devices.
  • Improved QoE for Real‑Time Apps – Lower SLO violations translate directly into smoother AR/VR experiences, more reliable voice assistants, and safer autonomous‑driving perception stacks.
  • Cost‑Effective Hardware Utilisation – Manufacturers can ship a single SoC SKU and still meet diverse workload demands by leveraging SparseLoom’s scheduler, postponing the need for higher‑end accelerators.
  • Easier Portability – The open‑source runtime abstracts away vendor‑specific SDKs, making it simpler to move a multi‑DNN workload from a Snapdragon to an Edge‑TPU or Jetson platform.

Limitations & Future Work

  • Granularity of Stitching – The current implementation stitches at the block level; finer‑grained layer‑wise stitching could unlock additional savings but would require more sophisticated dependency tracking.
  • Sparsity Compatibility – Not all sparsity patterns (e.g., unstructured vs. structured) are equally supported across accelerators; the scheduler may fall back to denser sub‑graphs when hardware support is lacking.
  • Static Model Pool – The pool of pre‑pruned models must be curated ahead of time; automated generation of this pool (e.g., via neural architecture search) is left for future research.
  • Energy Measurements – While throughput and memory were measured, a detailed power‑efficiency analysis on battery‑operated devices is still pending.

The authors plan to explore adaptive pruning that can generate new sparsity levels on‑device, and to extend SparseLoom to handle transformer‑based models, which are increasingly common in edge AI.

Authors

  • Jiawei Luo
  • Di Wu
  • Simon Dobson
  • Blesson Varghese

Paper Information

  • arXiv ID: 2603.09642v1
  • Categories: cs.DC, cs.LG, cs.PF
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »