[Paper] Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI

Published: 5 days ago (November 30, 2025 at 02:16 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.01039v1

Overview

The paper tackles a pressing problem for developers building AI‑powered services at the edge: how to run massive foundation models (e.g., large language or vision models) on a fleet of heterogeneous, bandwidth‑constrained devices that constantly change their compute and network conditions. Instead of fixing the model’s layer partitioning once at deployment time, the authors propose a runtime‑aware orchestration framework that jointly decides where each layer should run and how the model should be split, adapting on‑the‑fly to latency, utilization, and privacy constraints.

Key Contributions

Dynamic joint partition‑and‑placement formulation: Casts the problem as a constrained optimization that simultaneously selects layer assignments and physical locations, reacting to real‑time resource fluctuations.
Model‑aware capacity profiling: Introduces a lightweight profiling layer that continuously measures per‑device compute, memory, network bandwidth, and privacy‑related metrics.
Reactive graph re‑partitioning algorithm: A fast, near‑optimal heuristic that re‑splits the model graph when conditions change, avoiding costly full re‑optimizations.
Prototype implementation for 6G Multi‑Access Edge Computing (MEC): Demonstrates end‑to‑end integration with a realistic edge stack (container runtime, SD‑WAN, and secure enclaves).
Empirical evaluation on a suite of foundation models (BERT‑large, ViT‑B/16, Whisper‑base): Shows up to 3.2× latency reduction and 45 % lower bandwidth usage compared with static partitioning baselines.

Methodology

System Model – The edge environment is modeled as a directed graph where nodes represent compute resources (e.g., a smartphone, an edge server, a 6G base‑station) and edges capture network links with time‑varying latency and bandwidth.
Layer‑wise Cost Model – Each model layer is annotated with its compute demand, memory footprint, and data‑output size. These metrics are obtained via the profiling component during a short warm‑up run.
Optimization Objective – Minimize end‑to‑end inference latency while respecting constraints: (a) per‑node resource caps, (b) network bandwidth caps, and (c) privacy policies that forbid certain data from leaving trusted zones.
Solver Architecture – The problem is NP‑hard, so the authors design a two‑phase heuristic:
- Initial placement using a greedy “most‑constrained‑first” rule.
- Continuous re‑partitioning triggered by a change‑detection module (e.g., a 20 % rise in link latency). The re‑partitioner runs a lightweight graph‑cut algorithm that swaps only a few layers, keeping the overall solution stable.
Implementation Stack – Built on top of Kubernetes‑based edge orchestration, with custom CRDs (Custom Resource Definitions) for “ModelSlice” objects. Communication between slices uses gRPC with optional encryption for privacy‑sensitive hops.

Results & Findings

Model	Baseline (static)	Dynamic Joint (this work)	Latency Reduction	Bandwidth Savings
BERT‑large (text)	210 ms	68 ms	3.1×	48 %
ViT‑B/16 (vision)	340 ms	115 ms	2.9×	42 %
Whisper‑base (audio)	480 ms	150 ms	3.2×	45 %

Adaptivity: When a mobile device’s CPU load spiked (e.g., due to a background app), the framework automatically migrated the most compute‑heavy layers to a nearby edge server, keeping latency within the SLA.
Privacy compliance: In scenarios where raw video frames must stay on‑device, the system kept early convolutional layers local and only off‑loaded abstract feature maps, satisfying the privacy constraint without a noticeable latency penalty.
Overhead: The re‑partitioning decision loop runs in < 15 ms on a modest edge controller, making it suitable for real‑time workloads.

Practical Implications

Edge AI developers can now ship large foundation models without over‑provisioning hardware – the framework dynamically balances load across the whole edge continuum, reducing the need for expensive on‑device accelerators.
Service operators gain a unified control plane that respects privacy policies and SLA targets, simplifying compliance for regulated industries (healthcare, finance).
Network operators can leverage the approach to smooth traffic peaks: by shifting heavy layers to under‑utilized edge nodes, the system reduces backhaul usage, which is especially valuable in bandwidth‑constrained 5G/6G deployments.
Tooling integration: The authors released a Python SDK that plugs into existing model serving stacks (TensorRT, ONNX Runtime), meaning teams can adopt the technique with minimal code changes.
Potential for “model‑as‑a‑service” marketplaces where providers expose sliced models that automatically adapt to each consumer’s edge topology, opening new business models.

Limitations & Future Work

Scalability to thousands of nodes: The current prototype was evaluated on clusters of up to 20 edge nodes; the authors acknowledge that heuristic tuning will be needed for city‑scale deployments.
Model granularity: The approach works best when layers are relatively independent; highly inter‑dependent architectures (e.g., tightly coupled attention heads) may incur extra synchronization overhead.
Security assumptions: While data can be encrypted during transport, the framework does not yet incorporate secure multi‑party computation or homomorphic encryption for truly confidential inference.
Future directions include extending the optimizer to handle model updates (e.g., continual learning), integrating reinforcement‑learning‑based placement policies, and open‑sourcing the full orchestration stack for community benchmarking.

Authors

Aladin Djuhera
Fernando Koch
Alecio Binotto

Paper Information

arXiv ID: 2512.01039v1
Categories: cs.DC, cs.LG, cs.NI
Published: November 30, 2025
PDF: Download PDF

[Paper] Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Universal Weight Subspace Hypothesis

[Paper] Value Gradient Guidance for Flow Matching Alignment

[Paper] Deep infant brain segmentation from multi-contrast MRI

[Paper] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation