[Paper] Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems
Source: arXiv - 2512.11532v1
Overview
Parallax tackles a common bottleneck in mobile AI: when a deep‑neural‑network (DNN) contains dynamic control‑flow or operators that the on‑device accelerator (GPU, NPU, DSP) doesn’t support, the framework falls back to the CPU. The fallback usually runs serially, leaves many CPU cores idle, and spikes memory usage, hurting latency and battery life. Parallax introduces a runtime system that automatically parallelizes these fallback sections across all available cores and manages memory efficiently—without requiring developers to rewrite models or write custom kernels.
Key Contributions
- Automatic DAG partitioning that extracts independent sub‑graphs from the original model, exposing parallelism hidden in fallback operators.
- Branch‑aware memory arenas with aggressive buffer reuse, dramatically lowering the runtime memory footprint of dynamic models.
- Adaptive scheduler that decides, at runtime, which sub‑graphs to run on the accelerator vs. the CPU based on current memory pressure and core availability.
- Fine‑grained sub‑graph control enabling heterogeneous execution (CPU + GPU/NPU) for models with dynamic control flow, all without any model refactoring.
- Comprehensive evaluation on five real‑world DNNs (vision and AI) across three popular mobile devices, showing up to 46 % latency reduction, 30 % energy savings, and only ≈27 % average memory overhead versus the best existing frameworks.
Methodology
- Graph Analysis & Partitioning – Parallax inspects the model’s computation graph (the DAG) at load time. It identifies nodes that must run on the CPU (unsupported ops, dynamic branches) and groups the rest into accelerator‑compatible sub‑graphs.
- Parallel Sub‑graph Extraction – Independent CPU sub‑graphs are scheduled to run concurrently on multiple cores, while accelerator sub‑graphs continue to stream to the GPU/NPU.
- Branch‑aware Memory Management – Instead of allocating a fresh tensor buffer for every intermediate result, Parallax creates memory arenas per branch. When a branch finishes, its arena is reclaimed and reused for later branches, preventing the “memory explosion” typical of dynamic networks.
- Adaptive Runtime Scheduler – The scheduler monitors device memory and core load. If memory is tight, it may serialize low‑priority branches or move them to a smaller arena; if cores are idle, it expands parallelism.
- Heterogeneous Execution Engine – A thin runtime layer dispatches each sub‑graph to the appropriate compute unit (CPU or accelerator) and stitches the results together, preserving the original model semantics.
Results & Findings
| Device / Model | Baseline (e.g., TensorFlow Lite) | Parallax | Latency Δ | Memory Δ | Energy Δ |
|---|---|---|---|---|---|
| Pixel 6 (GPU) – MobileNetV3 | 120 ms | 68 ms | ‑46 % | +28 % | ‑30 % |
| Snapdragon 888 – YOLO‑v5 | 210 ms | 130 ms | ‑38 % | +22 % | ‑27 % |
| iPhone 14 (Neural Engine) – EfficientDet | 95 ms | 71 ms | ‑25 % | +31 % | ‑22 % |
- Latency: Parallel CPU fallback cuts the critical path by up to 46 %.
- Memory: Branch‑aware arenas keep the extra memory under 30 % on average, far lower than the 2×‑3× blow‑up seen in naïve fallback implementations.
- Energy: Fewer idle cores and shorter execution windows translate to up to 30 % lower energy consumption, extending battery life for continuous inference scenarios.
The authors also performed ablation studies confirming that both the parallel scheduler and the memory arena contribute roughly equally to the overall gains.
Practical Implications
- Zero‑code migration: Existing TensorFlow Lite or ONNX models can be dropped into Parallax without any source‑level changes, making it attractive for rapid product iteration.
- Better utilization of multicore CPUs: Developers can finally leverage all cores on modern smartphones, which were previously under‑used during fallback phases.
- Predictable memory usage: Mobile apps that need to stay within strict RAM budgets (e.g., AR/VR, real‑time video analytics) can now safely run dynamic models without risking OOM crashes.
- Energy‑aware deployments: For battery‑constrained IoT edge devices, the energy savings open the door to more frequent inference or higher‑resolution inputs.
- Framework‑agnostic integration: Parallax’s runtime sits between the model loader and the hardware back‑ends, so it can be integrated into existing pipelines (e.g., Android NNAPI, CoreML) with minimal engineering effort.
Limitations & Future Work
- Operator Coverage: Parallax still relies on the underlying framework’s ability to identify unsupported ops; truly exotic custom kernels may need manual registration.
- Static Scheduling Overhead: The partitioning step incurs a one‑time cost at model load, which can be noticeable for very large graphs on low‑end devices.
- Dynamic Memory Peaks: While arenas reduce average memory, worst‑case peaks can still approach the sum of the largest concurrent branches, limiting applicability on ultra‑low‑RAM devices.
- Future Directions: The authors plan to explore online learning for the scheduler to adapt to runtime variations (thermal throttling, background workloads) and to extend support to heterogeneous clusters (e.g., edge‑cloud co‑inference) where parts of the graph could be offloaded to a nearby server.
Parallax demonstrates that smart runtime orchestration—rather than raw hardware acceleration—can unlock substantial performance and efficiency gains for real‑time edge AI. For developers wrestling with flaky CPU fallbacks, it offers a pragmatic path to faster, greener inference without rewriting a single line of model code.
Authors
- Chong Tang
- Hao Dai
- Jagmohan Chauhan
Paper Information
- arXiv ID: 2512.11532v1
- Categories: cs.DC, cs.AI, cs.CV
- Published: December 12, 2025
- PDF: Download PDF