[Paper] HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC
Source: arXiv - 2603.01661v1
Overview
The paper introduces HeRo, a framework that intelligently orchestrates the many moving parts of an agentic Retrieval‑Augmented Generation (RAG) pipeline on modern heterogeneous mobile System‑on‑Chips (SoCs). By profiling each model‑accelerator pairing and dynamically scheduling work across CPUs, GPUs, NPUs, and DSPs, HeRo cuts end‑to‑end latency dramatically, making on‑device LLM‑powered assistants feasible for everyday smartphones and wearables.
Key Contributions
- Profiling‑driven performance models for every sub‑stage of an agentic RAG workflow, capturing latency, input shape sensitivity, and memory‑bandwidth contention on each processing unit (PU).
- Shape‑aware sub‑stage partitioning that splits workloads (e.g., token‑level vs. document‑level inference) to match the sweet spot of each accelerator.
- Criticality‑based accelerator mapping that assigns latency‑critical stages (e.g., query generation) to the fastest PU while off‑loading less‑time‑sensitive parts (e.g., dense retrieval) to lower‑power units.
- Bandwidth‑aware concurrency control that throttles parallel execution to avoid shared‑memory bottlenecks, a common issue on mobile SoCs.
- Lightweight online scheduler that uses the above models to make per‑inference decisions with negligible overhead.
- Empirical validation on commercial smartphones showing up to 10.94× latency reduction compared with naïve static deployment strategies.
Methodology
- Decompose the RAG pipeline – The authors break the agentic RAG flow into distinct stages: (a) user query encoding, (b) dense/sparse retrieval, (c) reasoning/plan generation, (d) answer synthesis, and (e) optional tool‑use or tool‑calling.
- Profile each stage – For every stage they run micro‑benchmarks on each available PU (CPU cores, GPU, NPU, DSP) across a range of input shapes (batch size, sequence length). The profiling captures three key metrics: raw compute latency, shape‑dependent performance curves, and slowdown caused by concurrent memory accesses.
- Build analytical models – Using the profiling data, simple regression models predict latency for any (stage, PU, shape) tuple, also estimating the extra penalty when multiple PUs contend for the shared memory bus.
- Online scheduling algorithm – At runtime, HeRo:
- Partitions the current request into shape‑optimal sub‑tasks.
- Ranks sub‑tasks by criticality (how much they affect overall latency).
- Maps each sub‑task to the PU that minimizes the predicted end‑to‑end time while respecting a global bandwidth budget.
- Executes the plan, monitoring actual latency and adjusting the model on‑the‑fly if predictions drift.
- Evaluation – The framework is tested on several flagship Android devices (e.g., Snapdragon 8 Gen 2, MediaTek Dimensity 9200) using open‑source LLMs (LLaMA‑2‑7B, Mistral‑7B) and retrieval back‑ends (FAISS, BM25).
Results & Findings
| Metric | Baseline (static CPU) | Static GPU | HeRo (adaptive) |
|---|---|---|---|
| End‑to‑end latency (avg) | 2.84 s | 1.96 s | 0.26 s |
| 90‑th percentile latency | 3.12 s | 2.21 s | 0.31 s |
| Energy per query | 1.84 J | 1.57 J | 0.92 J |
| Speedup vs. baseline | 1× | 1.45× | 10.94× |
- Latency reduction is most pronounced for the query‑encoding and reasoning stages, where HeRo places the LLM on the NPU (which excels at transformer kernels) while keeping retrieval on the DSP.
- Bandwidth awareness prevents the “memory wall” effect that would otherwise nullify the gains of parallel GPU+NPU execution.
- Energy efficiency improves because critical stages run on the most power‑efficient accelerator for their shape, and idle units are throttled.
Practical Implications
- On‑device AI assistants can now respond in sub‑second time without offloading to the cloud, preserving user privacy and reducing data‑plan costs.
- Edge applications such as real‑time translation, code completion, or context‑aware AR overlays become feasible on smartphones, wearables, and even IoT gateways.
- Developers can integrate HeRo as a library (e.g., via a lightweight C++/JNI wrapper) to automatically get optimal scheduling for any new LLM or retrieval model they plug in, without hand‑tuning for each device.
- Hardware vendors gain a concrete use‑case for exposing fine‑grained performance counters and bandwidth‑control APIs, encouraging more open heterogeneous scheduling support in future SoCs.
Limitations & Future Work
- Model coverage – The current profiling suite focuses on transformer‑based LLMs up to 7 B parameters; scaling to 30 B‑plus models may require additional memory‑management strategies.
- Dynamic workloads – HeRo assumes relatively stable request patterns; bursty traffic could temporarily exceed the bandwidth budget, leading to queuing delays.
- Cross‑device portability – While the framework abstracts PU types, each SoC vendor’s driver stack differs, so a small amount of platform‑specific glue code is still needed.
- Future directions suggested by the authors include extending the scheduler to handle multi‑user scenarios, integrating reinforcement‑learning‑based policy refinement, and exploring compiler‑level operator fusion to further shrink latency.
Authors
- Maoliang Li
- Jiayu Chen
- Zihao Zheng
- Ziqian Li
- Xinhao Sun
- Guojie Luo
- Chenchen Liu
- Xiang Chen
Paper Information
- arXiv ID: 2603.01661v1
- Categories: cs.DC
- Published: March 2, 2026
- PDF: Download PDF