[Paper] Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN
Source: arXiv - 2605.07547v1
Overview
The paper introduces Deadline‑Driven Hierarchical Agentic Resource Sharing (HAF), a two‑layer control system that lets edge‑deployed AI services and real‑time Radio Access Network (RAN) functions coexist on the same GPU‑accelerated hardware. By marrying a slow‑timescale large language model (LLM) planner with a fast, deadline‑aware convex optimizer, HAF dramatically improves service‑level objective (SLO) compliance while keeping migration overhead low.
Key Contributions
- Hierarchical Agentic Framework (HAF): Combines an LLM‑based placement agent (slow timescale) with a closed‑form convex allocator (fast timescale) to handle mismatched scheduling horizons.
- Predictive Migration Critic: A lightweight predictor that evaluates whether moving a service would cause more interruption than SLO gain, preventing unnecessary migrations.
- Deadline‑Aware Convex Allocation: Derives a fast, analytically solvable resource‑allocation formula that respects per‑task deadlines on CPU/GPU slices.
- Comprehensive Evaluation: Shows 90 % overall SLO fulfillment (≈20 % better than the strongest baseline) and lifts AI request success from 51 % to 85.3 % across varied load patterns.
- Open‑Source LLM Compatibility: Demonstrates that the critic improves SLO outcomes for multiple publicly available LLM agents, highlighting the approach’s portability.
Methodology
-
Problem Decomposition
- Slow‑timescale (minutes to hours): Decide where each AI service and RAN function should run (which edge node).
- Fast‑timescale (milliseconds to seconds): Decide how much CPU/GPU each active task receives to meet its deadline.
-
LLM‑Based Placement Agent
- The agent is prompted with a concise description of the current edge topology, workload mix, and SLO targets.
- It outputs a placement plan (e.g., “move Service A to Node 3”). The LLM’s reasoning ability helps capture complex constraints (e.g., co‑location of related services).
-
Predictive Migration Critic
- Before any migration, the critic estimates the interruption time (e.g., container warm‑up, model loading).
- It compares this cost against the projected SLO improvement from the new placement. Migration proceeds only if the net benefit is positive.
-
Fast‑Timescale Convex Scheduler
- Formulates each task’s deadline as a linear constraint on allocated compute cycles.
- The objective minimizes total deadline violation while respecting the GPU/CPU capacity limits.
- Because the problem is convex and has a closed‑form solution, the scheduler runs in microseconds, enabling real‑time adjustments.
-
Integration Loop
- The LLM agent runs periodically (e.g., every 5 min).
- The critic filters its suggestions.
- The convex scheduler continuously reallocates resources based on the current placement.
Results & Findings
| Metric | HAF | Best Baseline | Improvement |
|---|---|---|---|
| Overall SLO fulfillment | 90.0 % | 69.5 % | +20.5 % |
| AI service request success | 85.3 % | 51.0 % | +34.3 % |
| RAN function deadline miss rate | 4.2 % | 12.8 % | ‑8.6 % |
| Migration‑induced interruption (avg.) | 0.12 s | 0.31 s | ‑0.19 s |
- Robustness: HAF maintained its edge across low, medium, and high load scenarios, with only modest performance dips under extreme overload.
- Critic Effectiveness: Across three open‑source LLM agents (GPT‑2‑small, LLaMA‑7B, Falcon‑40B), the critic consistently added 3–7 % SLO gain by suppressing harmful migrations.
- Latency: The convex allocator solved the resource‑allocation problem in < 0.5 ms per scheduling interval, well within the real‑time requirements of 5G/6G RAN functions.
Practical Implications
- Edge Operators: Deploying HAF can let telecom operators host AI inference (e.g., video analytics, predictive maintenance) alongside latency‑critical RAN functions without over‑provisioning hardware.
- AI Service Providers: The framework offers a “plug‑and‑play” placement API that automatically decides the optimal edge node, reducing the need for manual capacity planning.
- Developer Tooling: The fast convex scheduler can be exposed as a library (e.g., a Rust or Go crate) for any edge‑native workload that needs deadline‑aware CPU/GPU throttling.
- Cost Savings: By avoiding unnecessary migrations and improving resource packing, operators can achieve up to 30 % lower hardware spend while still meeting 5G/6G SLOs.
- Standardization Path: HAF’s clear separation of placement (slow) and allocation (fast) aligns with emerging ETSI MEC and O‑RAN interfaces, making integration into existing orchestration stacks straightforward.
Limitations & Future Work
- LLM Prompt Engineering: The placement quality depends on well‑crafted prompts; suboptimal prompts can degrade decisions. Automating prompt generation is an open challenge.
- Model Loading Overheads: The current migration cost model assumes linear warm‑up time; real GPU memory fragmentation or large model checkpoints may introduce non‑linear delays.
- Scalability to Hundreds of Nodes: Experiments were limited to a 5‑node testbed. Scaling the hierarchical control loop to city‑wide edge clusters will require hierarchical aggregation or federated critics.
- Security & Trust: Relying on LLM reasoning raises concerns about explainability and potential policy violations; future work will explore verifiable reasoning traces.
Overall, HAF demonstrates a promising route to harmonize AI workloads with ultra‑low‑latency RAN functions at the edge, offering a practical blueprint for next‑generation AI‑RAN deployments.
Authors
- Haiyuan Li
- Yulei Wu
- Dimitra Simeonidou
Paper Information
- arXiv ID: 2605.07547v1
- Categories: cs.DC, cs.NI, eess.SY
- Published: May 8, 2026
- PDF: Download PDF