[Paper] Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI

Published: (November 30, 2025 at 02:16 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.01039v1

Overview

The paper tackles a pressing problem for developers building AI‑powered services at the edge: how to run massive foundation models (e.g., large language or vision models) on a fleet of heterogeneous, bandwidth‑constrained devices that constantly change their compute and network conditions. Instead of fixing the model’s layer partitioning once at deployment time, the authors propose a runtime‑aware orchestration framework that jointly decides where each layer should run and how the model should be split, adapting on‑the‑fly to latency, utilization, and privacy constraints.

Key Contributions

  • Dynamic joint partition‑and‑placement formulation: Casts the problem as a constrained optimization that simultaneously selects layer assignments and physical locations, reacting to real‑time resource fluctuations.
  • Model‑aware capacity profiling: Introduces a lightweight profiling layer that continuously measures per‑device compute, memory, network bandwidth, and privacy‑related metrics.
  • Reactive graph re‑partitioning algorithm: A fast, near‑optimal heuristic that re‑splits the model graph when conditions change, avoiding costly full re‑optimizations.
  • Prototype implementation for 6G Multi‑Access Edge Computing (MEC): Demonstrates end‑to‑end integration with a realistic edge stack (container runtime, SD‑WAN, and secure enclaves).
  • Empirical evaluation on a suite of foundation models (BERT‑large, ViT‑B/16, Whisper‑base): Shows up to 3.2× latency reduction and 45 % lower bandwidth usage compared with static partitioning baselines.

Methodology

  1. System Model – The edge environment is modeled as a directed graph where nodes represent compute resources (e.g., a smartphone, an edge server, a 6G base‑station) and edges capture network links with time‑varying latency and bandwidth.
  2. Layer‑wise Cost Model – Each model layer is annotated with its compute demand, memory footprint, and data‑output size. These metrics are obtained via the profiling component during a short warm‑up run.
  3. Optimization Objective – Minimize end‑to‑end inference latency while respecting constraints: (a) per‑node resource caps, (b) network bandwidth caps, and (c) privacy policies that forbid certain data from leaving trusted zones.
  4. Solver Architecture – The problem is NP‑hard, so the authors design a two‑phase heuristic:
    • Initial placement using a greedy “most‑constrained‑first” rule.
    • Continuous re‑partitioning triggered by a change‑detection module (e.g., a 20 % rise in link latency). The re‑partitioner runs a lightweight graph‑cut algorithm that swaps only a few layers, keeping the overall solution stable.
  5. Implementation Stack – Built on top of Kubernetes‑based edge orchestration, with custom CRDs (Custom Resource Definitions) for “ModelSlice” objects. Communication between slices uses gRPC with optional encryption for privacy‑sensitive hops.

Results & Findings

ModelBaseline (static)Dynamic Joint (this work)Latency ReductionBandwidth Savings
BERT‑large (text)210 ms68 ms3.1×48 %
ViT‑B/16 (vision)340 ms115 ms2.9×42 %
Whisper‑base (audio)480 ms150 ms3.2×45 %
  • Adaptivity: When a mobile device’s CPU load spiked (e.g., due to a background app), the framework automatically migrated the most compute‑heavy layers to a nearby edge server, keeping latency within the SLA.
  • Privacy compliance: In scenarios where raw video frames must stay on‑device, the system kept early convolutional layers local and only off‑loaded abstract feature maps, satisfying the privacy constraint without a noticeable latency penalty.
  • Overhead: The re‑partitioning decision loop runs in < 15 ms on a modest edge controller, making it suitable for real‑time workloads.

Practical Implications

  • Edge AI developers can now ship large foundation models without over‑provisioning hardware – the framework dynamically balances load across the whole edge continuum, reducing the need for expensive on‑device accelerators.
  • Service operators gain a unified control plane that respects privacy policies and SLA targets, simplifying compliance for regulated industries (healthcare, finance).
  • Network operators can leverage the approach to smooth traffic peaks: by shifting heavy layers to under‑utilized edge nodes, the system reduces backhaul usage, which is especially valuable in bandwidth‑constrained 5G/6G deployments.
  • Tooling integration: The authors released a Python SDK that plugs into existing model serving stacks (TensorRT, ONNX Runtime), meaning teams can adopt the technique with minimal code changes.
  • Potential for “model‑as‑a‑service” marketplaces where providers expose sliced models that automatically adapt to each consumer’s edge topology, opening new business models.

Limitations & Future Work

  • Scalability to thousands of nodes: The current prototype was evaluated on clusters of up to 20 edge nodes; the authors acknowledge that heuristic tuning will be needed for city‑scale deployments.
  • Model granularity: The approach works best when layers are relatively independent; highly inter‑dependent architectures (e.g., tightly coupled attention heads) may incur extra synchronization overhead.
  • Security assumptions: While data can be encrypted during transport, the framework does not yet incorporate secure multi‑party computation or homomorphic encryption for truly confidential inference.
  • Future directions include extending the optimizer to handle model updates (e.g., continual learning), integrating reinforcement‑learning‑based placement policies, and open‑sourcing the full orchestration stack for community benchmarking.

Authors

  • Aladin Djuhera
  • Fernando Koch
  • Alecio Binotto

Paper Information

  • arXiv ID: 2512.01039v1
  • Categories: cs.DC, cs.LG, cs.NI
  • Published: November 30, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »