[Paper] Where to Split? A Pareto-Front Analysis of DNN Partitioning for Edge Inference

Published: (January 12, 2026 at 04:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.08025v1

Overview

Deploying deep neural networks (DNNs) on edge devices such as Raspberry Pis or low‑power GPUs is often blocked by limited compute, memory, and network bandwidth. This paper reframes DNN partitioning—not as a single‑objective “make it fast” problem—but as a multi‑objective trade‑off between latency and throughput, especially under realistic, fluctuating network conditions. The authors introduce ParetoPipe, an open‑source framework that systematically discovers the best split points on the Pareto front, giving engineers a practical way to balance speed and data‑rate for edge inference.

Key Contributions

  • Pareto‑front based partitioning: Treats latency and throughput as simultaneous objectives and uses Pareto analysis to locate optimal split points.
  • Comprehensive benchmark suite: Evaluates pipeline‑partitioned inference on a heterogeneous testbed (multiple Raspberry Pis + a GPU‑enabled edge server) across diverse network scenarios.
  • Open‑source toolchain (ParetoPipe): Provides dual communication back‑ends (PyTorch RPC and a lightweight custom protocol), flexible APIs for model slicing, and scripts for automated Pareto‑front generation.
  • Empirical insights: Quantifies how network variability reshapes the latency‑throughput trade‑off, revealing non‑intuitive partition choices that outperform naïve “split‑at‑layer‑X” heuristics.

Methodology

  1. Model & Device Profiling – The authors profile each DNN layer’s compute time and memory footprint on every device in the testbed, as well as the data size of intermediate activations.
  2. Search Space Construction – All possible contiguous partition points (e.g., “run layers 0‑k on device A, rest on device B”) are enumerated. For each candidate split, the end‑to‑end latency and achievable throughput are estimated using the profiled numbers plus a network model that can be tuned to represent different bandwidth/latency conditions.
  3. Pareto Front Extraction – The candidate splits are plotted in the latency‑throughput space; those that are not dominated (i.e., no other split is both faster and higher‑throughput) form the Pareto front.
  4. Implementation & Validation – ParetoPipe materializes the selected splits on the physical testbed, executing real inference pipelines via either PyTorch RPC or a custom lightweight socket layer to verify the analytical predictions.
  5. Scenario Sweeps – Experiments sweep across network conditions (Wi‑Fi, Ethernet, throttled links) and batch sizes to observe how the front shifts.

Results & Findings

ScenarioBest‑Latency SplitBest‑Throughput SplitPareto‑Front Shape
High‑bandwidth LAN (1 Gbps)All layers on edge GPU (≈ 3 ms latency)Split after early conv layers (≈ 150 fps)Narrow front – latency and throughput improve together
Moderate Wi‑Fi (30 Mbps)Early split: first few layers on Pi, rest on GPU (≈ 7 ms)Later split: more work on Pi to reduce traffic (≈ 80 fps)Wider front – clear trade‑off
Low‑bandwidth (5 Mbps)Heavy off‑loading to Pi (≈ 12 ms)Maximize local compute on Pi (≈ 30 fps)Very wide front – latency gains come at steep throughput loss

Key takeaways

  • Network bandwidth is a first‑order factor; the optimal split can move dramatically when bandwidth drops.
  • The Pareto‑optimal points often lie in the middle of the layer chain, contradicting the common “edge‑only” or “cloud‑only” extremes.
  • Using the lightweight custom RPC reduces communication overhead by ~15 % compared with vanilla PyTorch RPC, tightening the Pareto front.

Practical Implications

  • Dynamic Edge Orchestration – Developers can embed ParetoPipe’s decision engine into runtime managers that re‑evaluate the split on‑the‑fly as network conditions change (e.g., mobile edge, IoT gateways).
  • Resource‑Aware Model Deployment – Instead of manually tuning batch sizes or pruning models, engineers can let Pareto analysis pick the split that meets a Service Level Objective (SLA) for latency while maximizing throughput.
  • Cost‑Effective Scaling – Small edge clusters (Raspberry Pis, Jetson Nano, etc.) can collectively achieve GPU‑level throughput without buying expensive hardware, simply by exploiting optimal pipeline partitioning.
  • Framework Integration – Because ParetoPipe ships with both PyTorch RPC and a minimal custom protocol, it can be dropped into existing PyTorch pipelines or used in non‑Python environments with minimal glue code.

Limitations & Future Work

  • Static Layer Granularity – The current search only considers whole‑layer splits; finer‑grained tensor partitioning could unlock additional Pareto points.
  • Energy Consumption Not Modeled – While latency and throughput are critical, edge deployments often care about power; extending the framework to include energy as a third objective is left for later work.
  • Scalability to Larger Clusters – Experiments were limited to a 4‑node Raspberry Pi cluster plus one GPU server; scaling the analysis to dozens of heterogeneous nodes may require heuristic pruning of the search space.
  • Network Model Simplifications – Real‑world wireless networks exhibit bursty loss and jitter; incorporating stochastic network models could make the Pareto front more robust.

ParetoPipe opens the door for developers to treat edge inference as a balanced optimization problem rather than a single‑goal hack. By exposing the full latency‑throughput frontier, it empowers smarter, adaptive deployments that can keep up with the ever‑changing edge landscape.

Authors

  • Adiba Masud
  • Nicholas Foley
  • Pragathi Durga Rajarajan
  • Palden Lama

Paper Information

  • arXiv ID: 2601.08025v1
  • Categories: cs.DC
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »