[Paper] Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI
Source: arXiv - 2512.01039v1
Overview
The paper tackles a pressing problem for developers building AI‑powered services at the edge: how to run massive foundation models (e.g., large language or vision models) on a fleet of heterogeneous, bandwidth‑constrained devices that constantly change their compute and network conditions. Instead of fixing the model’s layer partitioning once at deployment time, the authors propose a runtime‑aware orchestration framework that jointly decides where each layer should run and how the model should be split, adapting on‑the‑fly to latency, utilization, and privacy constraints.
Key Contributions
- Dynamic joint partition‑and‑placement formulation: Casts the problem as a constrained optimization that simultaneously selects layer assignments and physical locations, reacting to real‑time resource fluctuations.
- Model‑aware capacity profiling: Introduces a lightweight profiling layer that continuously measures per‑device compute, memory, network bandwidth, and privacy‑related metrics.
- Reactive graph re‑partitioning algorithm: A fast, near‑optimal heuristic that re‑splits the model graph when conditions change, avoiding costly full re‑optimizations.
- Prototype implementation for 6G Multi‑Access Edge Computing (MEC): Demonstrates end‑to‑end integration with a realistic edge stack (container runtime, SD‑WAN, and secure enclaves).
- Empirical evaluation on a suite of foundation models (BERT‑large, ViT‑B/16, Whisper‑base): Shows up to 3.2× latency reduction and 45 % lower bandwidth usage compared with static partitioning baselines.
Methodology
- System Model – The edge environment is modeled as a directed graph where nodes represent compute resources (e.g., a smartphone, an edge server, a 6G base‑station) and edges capture network links with time‑varying latency and bandwidth.
- Layer‑wise Cost Model – Each model layer is annotated with its compute demand, memory footprint, and data‑output size. These metrics are obtained via the profiling component during a short warm‑up run.
- Optimization Objective – Minimize end‑to‑end inference latency while respecting constraints: (a) per‑node resource caps, (b) network bandwidth caps, and (c) privacy policies that forbid certain data from leaving trusted zones.
- Solver Architecture – The problem is NP‑hard, so the authors design a two‑phase heuristic:
- Initial placement using a greedy “most‑constrained‑first” rule.
- Continuous re‑partitioning triggered by a change‑detection module (e.g., a 20 % rise in link latency). The re‑partitioner runs a lightweight graph‑cut algorithm that swaps only a few layers, keeping the overall solution stable.
- Implementation Stack – Built on top of Kubernetes‑based edge orchestration, with custom CRDs (Custom Resource Definitions) for “ModelSlice” objects. Communication between slices uses gRPC with optional encryption for privacy‑sensitive hops.
Results & Findings
| Model | Baseline (static) | Dynamic Joint (this work) | Latency Reduction | Bandwidth Savings |
|---|---|---|---|---|
| BERT‑large (text) | 210 ms | 68 ms | 3.1× | 48 % |
| ViT‑B/16 (vision) | 340 ms | 115 ms | 2.9× | 42 % |
| Whisper‑base (audio) | 480 ms | 150 ms | 3.2× | 45 % |
- Adaptivity: When a mobile device’s CPU load spiked (e.g., due to a background app), the framework automatically migrated the most compute‑heavy layers to a nearby edge server, keeping latency within the SLA.
- Privacy compliance: In scenarios where raw video frames must stay on‑device, the system kept early convolutional layers local and only off‑loaded abstract feature maps, satisfying the privacy constraint without a noticeable latency penalty.
- Overhead: The re‑partitioning decision loop runs in < 15 ms on a modest edge controller, making it suitable for real‑time workloads.
Practical Implications
- Edge AI developers can now ship large foundation models without over‑provisioning hardware – the framework dynamically balances load across the whole edge continuum, reducing the need for expensive on‑device accelerators.
- Service operators gain a unified control plane that respects privacy policies and SLA targets, simplifying compliance for regulated industries (healthcare, finance).
- Network operators can leverage the approach to smooth traffic peaks: by shifting heavy layers to under‑utilized edge nodes, the system reduces backhaul usage, which is especially valuable in bandwidth‑constrained 5G/6G deployments.
- Tooling integration: The authors released a Python SDK that plugs into existing model serving stacks (TensorRT, ONNX Runtime), meaning teams can adopt the technique with minimal code changes.
- Potential for “model‑as‑a‑service” marketplaces where providers expose sliced models that automatically adapt to each consumer’s edge topology, opening new business models.
Limitations & Future Work
- Scalability to thousands of nodes: The current prototype was evaluated on clusters of up to 20 edge nodes; the authors acknowledge that heuristic tuning will be needed for city‑scale deployments.
- Model granularity: The approach works best when layers are relatively independent; highly inter‑dependent architectures (e.g., tightly coupled attention heads) may incur extra synchronization overhead.
- Security assumptions: While data can be encrypted during transport, the framework does not yet incorporate secure multi‑party computation or homomorphic encryption for truly confidential inference.
- Future directions include extending the optimizer to handle model updates (e.g., continual learning), integrating reinforcement‑learning‑based placement policies, and open‑sourcing the full orchestration stack for community benchmarking.
Authors
- Aladin Djuhera
- Fernando Koch
- Alecio Binotto
Paper Information
- arXiv ID: 2512.01039v1
- Categories: cs.DC, cs.LG, cs.NI
- Published: November 30, 2025
- PDF: Download PDF