[Paper] Peformance Isolation for Inference Processes in Edge GPU Systems

Published: (January 12, 2026 at 09:49 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07600v1

Overview

The paper evaluates how modern NVIDIA GPU isolation features—Multi‑Process Service (MPS), Multi‑Instance GPU (MIG), and the newly introduced Green Contexts—affect the predictability of deep‑learning inference on edge devices. By benchmarking both a data‑center‑class A100 and an edge‑focused Jetson Orin, the authors show which mechanisms can give safety‑critical applications the timing guarantees they need while still keeping GPU utilization high.

Key Contributions

  • Systematic comparison of MPS, MIG, and Green Contexts on two very different NVIDIA GPUs (A100 vs. Jetson Orin).
  • Quantitative isolation metrics: latency variance, throughput loss, and memory contention under mixed‑workload scenarios.
  • Demonstration that MIG delivers strong temporal and memory isolation on both platforms, albeit with a noticeable performance overhead for small partitions.
  • Introduction of Green Contexts as a low‑overhead, fine‑grained SM (Streaming Multiprocessor) allocation technique that works well on edge GPUs, though it lacks memory isolation.
  • Guidelines and best‑practice recommendations for developers building safety‑critical inference pipelines on shared GPUs.
  • Identification of open challenges (e.g., lack of memory protection in Green Contexts, coarse granularity of MIG on low‑power devices) and a roadmap for future research.

Methodology

  1. Platform selection – Experiments run on an NVIDIA A100 (PCIe) and a Jetson Orin (integrated ARM‑GPU).
  2. Workloads – A set of representative inference models (ResNet‑50, BERT, YOLO‑v5) executed as separate processes or containers.
  3. Isolation configurations
    • MPS: multiple processes share the same GPU context.
    • MIG: the GPU is sliced into up to 7 (A100) or 4 (Orin) instances, each with dedicated SMs, memory, and cache.
    • Green Contexts: custom driver extension that pins a subset of SMs to a process without creating a full MIG instance.
  4. Metrics collected – End‑to‑end inference latency (mean, 95th‑percentile, jitter), throughput, GPU utilization, and memory bandwidth contention.
  5. Temporal isolation test – A “high‑priority” inference job runs concurrently with a “background” GPU‑heavy task (e.g., video encoding) to see how much the background load perturbs the latency of the critical job.
  6. Statistical analysis – Repeated runs (≥30 per configuration) to ensure confidence intervals and to isolate variance caused by the isolation mechanism itself.

Results & Findings

MechanismTemporal IsolationMemory IsolationAvg. Latency OverheadNotable Observations
MPSModerate (jitter up to +30 ms)No (shared memory)~5 % on A100, ~8 % on OrinSimple to enable, but contention spikes when background jobs saturate the GPU.
MIGStrong (jitter < 5 ms)Yes (dedicated VRAM per instance)10‑15 % for small slices, < 5 % for larger slicesWorks on both platforms; fine‑grained slicing limited on Orin (max 4 instances).
Green ContextsGood (jitter ≈ 10 ms)No (shared memory)< 3 %Very low overhead, can allocate at SM‑level granularity; ideal for edge where MIG is unavailable or too coarse.
  • MIG consistently delivered the most predictable latency, making it the safest choice for hard real‑time constraints, but the performance penalty grows when the GPU is split into many tiny instances.
  • Green Contexts achieved near‑zero overhead on the Jetson Orin, enabling developers to reserve just a few SMs for critical inference while leaving the rest for auxiliary tasks (e.g., sensor fusion).
  • MPS proved useful for workloads that can tolerate occasional latency spikes, offering the highest overall throughput when the GPU is fully utilized.

Practical Implications

  • Safety‑critical edge AI (autonomous drones, medical imaging, industrial robotics) can now choose a concrete isolation strategy rather than guessing. For strict timing guarantees, MIG is the go‑to, even on compact devices like the Orin.
  • Resource‑constrained deployments can leverage Green Contexts to carve out a “fast lane” for inference without the memory fragmentation that MIG introduces, keeping the rest of the GPU free for non‑critical tasks.
  • CI/CD pipelines for AI services can incorporate these isolation settings into Docker or Kubernetes GPU‑device plugins, ensuring that multi‑tenant inference servers do not interfere with each other.
  • Cost optimization: By partitioning a single high‑end GPU (A100) with MIG, multiple inference services can run side‑by‑side on the same hardware, reducing cloud GPU spend while still meeting SLAs.
  • Developer tooling: The paper’s methodology can be replicated with open‑source scripts (CUDA events, Nsight Systems) to profile your own models and decide the right SM‑to‑process mapping.

Limitations & Future Work

  • Memory isolation missing in Green Contexts – without VRAM partitioning, a rogue process could still evict critical data from cache or cause page‑fault‑induced stalls.
  • Coarse MIG granularity on low‑power GPUs – the Jetson Orin only supports up to four instances, limiting flexibility for workloads that need many tiny slices.
  • Benchmark scope – only three models were tested; more diverse architectures (e.g., transformer‑based vision models) could reveal different contention patterns.
  • Dynamic re‑partitioning – the study kept partitions static; future work could explore runtime adaptation (e.g., scaling MIG instances on‑the‑fly based on workload).
  • Security aspects – while temporal isolation is addressed, the paper does not evaluate side‑channel leakage between contexts, an important consideration for multi‑tenant edge deployments.

Overall, the research provides a clear, data‑driven roadmap for engineers who need predictable GPU inference on the edge, and it opens several avenues for tighter, more flexible isolation mechanisms in the next generation of NVIDIA devices.

Authors

  • Juan José Martín
  • José Flich
  • Carles Hernández

Paper Information

  • arXiv ID: 2601.07600v1
  • Categories: cs.OS, cs.DC
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »