[Paper] Peformance Isolation for Inference Processes in Edge GPU Systems

Published: 1 week ago (January 12, 2026 at 09:49 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.07600v1

Overview

The paper evaluates how modern NVIDIA GPU isolation features—Multi‑Process Service (MPS), Multi‑Instance GPU (MIG), and the newly introduced Green Contexts—affect the predictability of deep‑learning inference on edge devices. By benchmarking both a data‑center‑class A100 and an edge‑focused Jetson Orin, the authors show which mechanisms can give safety‑critical applications the timing guarantees they need while still keeping GPU utilization high.

Key Contributions

Systematic comparison of MPS, MIG, and Green Contexts on two very different NVIDIA GPUs (A100 vs. Jetson Orin).
Quantitative isolation metrics: latency variance, throughput loss, and memory contention under mixed‑workload scenarios.
Demonstration that MIG delivers strong temporal and memory isolation on both platforms, albeit with a noticeable performance overhead for small partitions.
Introduction of Green Contexts as a low‑overhead, fine‑grained SM (Streaming Multiprocessor) allocation technique that works well on edge GPUs, though it lacks memory isolation.
Guidelines and best‑practice recommendations for developers building safety‑critical inference pipelines on shared GPUs.
Identification of open challenges (e.g., lack of memory protection in Green Contexts, coarse granularity of MIG on low‑power devices) and a roadmap for future research.

Methodology

Platform selection – Experiments run on an NVIDIA A100 (PCIe) and a Jetson Orin (integrated ARM‑GPU).
Workloads – A set of representative inference models (ResNet‑50, BERT, YOLO‑v5) executed as separate processes or containers.
Isolation configurations –
- MPS: multiple processes share the same GPU context.
- MIG: the GPU is sliced into up to 7 (A100) or 4 (Orin) instances, each with dedicated SMs, memory, and cache.
- Green Contexts: custom driver extension that pins a subset of SMs to a process without creating a full MIG instance.
Metrics collected – End‑to‑end inference latency (mean, 95th‑percentile, jitter), throughput, GPU utilization, and memory bandwidth contention.
Temporal isolation test – A “high‑priority” inference job runs concurrently with a “background” GPU‑heavy task (e.g., video encoding) to see how much the background load perturbs the latency of the critical job.
Statistical analysis – Repeated runs (≥30 per configuration) to ensure confidence intervals and to isolate variance caused by the isolation mechanism itself.

Results & Findings

Mechanism	Temporal Isolation	Memory Isolation	Avg. Latency Overhead	Notable Observations
MPS	Moderate (jitter up to +30 ms)	No (shared memory)	~5 % on A100, ~8 % on Orin	Simple to enable, but contention spikes when background jobs saturate the GPU.
MIG	Strong (jitter < 5 ms)	Yes (dedicated VRAM per instance)	10‑15 % for small slices, < 5 % for larger slices	Works on both platforms; fine‑grained slicing limited on Orin (max 4 instances).
Green Contexts	Good (jitter ≈ 10 ms)	No (shared memory)	< 3 %	Very low overhead, can allocate at SM‑level granularity; ideal for edge where MIG is unavailable or too coarse.

MIG consistently delivered the most predictable latency, making it the safest choice for hard real‑time constraints, but the performance penalty grows when the GPU is split into many tiny instances.
Green Contexts achieved near‑zero overhead on the Jetson Orin, enabling developers to reserve just a few SMs for critical inference while leaving the rest for auxiliary tasks (e.g., sensor fusion).
MPS proved useful for workloads that can tolerate occasional latency spikes, offering the highest overall throughput when the GPU is fully utilized.

Practical Implications

Safety‑critical edge AI (autonomous drones, medical imaging, industrial robotics) can now choose a concrete isolation strategy rather than guessing. For strict timing guarantees, MIG is the go‑to, even on compact devices like the Orin.
Resource‑constrained deployments can leverage Green Contexts to carve out a “fast lane” for inference without the memory fragmentation that MIG introduces, keeping the rest of the GPU free for non‑critical tasks.
CI/CD pipelines for AI services can incorporate these isolation settings into Docker or Kubernetes GPU‑device plugins, ensuring that multi‑tenant inference servers do not interfere with each other.
Cost optimization: By partitioning a single high‑end GPU (A100) with MIG, multiple inference services can run side‑by‑side on the same hardware, reducing cloud GPU spend while still meeting SLAs.
Developer tooling: The paper’s methodology can be replicated with open‑source scripts (CUDA events, Nsight Systems) to profile your own models and decide the right SM‑to‑process mapping.

Limitations & Future Work

Memory isolation missing in Green Contexts – without VRAM partitioning, a rogue process could still evict critical data from cache or cause page‑fault‑induced stalls.
Coarse MIG granularity on low‑power GPUs – the Jetson Orin only supports up to four instances, limiting flexibility for workloads that need many tiny slices.
Benchmark scope – only three models were tested; more diverse architectures (e.g., transformer‑based vision models) could reveal different contention patterns.
Dynamic re‑partitioning – the study kept partitions static; future work could explore runtime adaptation (e.g., scaling MIG instances on‑the‑fly based on workload).
Security aspects – while temporal isolation is addressed, the paper does not evaluate side‑channel leakage between contexts, an important consideration for multi‑tenant edge deployments.

Overall, the research provides a clear, data‑driven roadmap for engineers who need predictable GPU inference on the edge, and it opens several avenues for tighter, more flexible isolation mechanisms in the next generation of NVIDIA devices.

Authors

Juan José Martín
José Flich
Carles Hernández

Paper Information

arXiv ID: 2601.07600v1
Categories: cs.OS, cs.DC
Published: January 12, 2026
PDF: Download PDF

[Paper] Peformance Isolation for Inference Processes in Edge GPU Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Space-Optimal, Computation-Optimal, Topology-Agnostic, Throughput-Scalable Causal Delivery through Hybrid Buffering

[Paper] Konflux: Optimized Function Fusion for Serverless Applications

[Paper] AFLL: Real-time Load Stabilization for MMO Game Servers Based on Circular Causality Learning

[Paper] Breaking the Storage-Bandwidth Tradeoff in Distributed Storage with Quantum Entanglement