[Paper] Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN

Published: 3 days ago (May 8, 2026 at 06:22 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.07547v1

Overview

The paper introduces Deadline‑Driven Hierarchical Agentic Resource Sharing (HAF), a two‑layer control system that lets edge‑deployed AI services and real‑time Radio Access Network (RAN) functions coexist on the same GPU‑accelerated hardware. By marrying a slow‑timescale large language model (LLM) planner with a fast, deadline‑aware convex optimizer, HAF dramatically improves service‑level objective (SLO) compliance while keeping migration overhead low.

Key Contributions

Hierarchical Agentic Framework (HAF): Combines an LLM‑based placement agent (slow timescale) with a closed‑form convex allocator (fast timescale) to handle mismatched scheduling horizons.
Predictive Migration Critic: A lightweight predictor that evaluates whether moving a service would cause more interruption than SLO gain, preventing unnecessary migrations.
Deadline‑Aware Convex Allocation: Derives a fast, analytically solvable resource‑allocation formula that respects per‑task deadlines on CPU/GPU slices.
Comprehensive Evaluation: Shows 90 % overall SLO fulfillment (≈20 % better than the strongest baseline) and lifts AI request success from 51 % to 85.3 % across varied load patterns.
Open‑Source LLM Compatibility: Demonstrates that the critic improves SLO outcomes for multiple publicly available LLM agents, highlighting the approach’s portability.

Methodology

Problem Decomposition
- Slow‑timescale (minutes to hours): Decide where each AI service and RAN function should run (which edge node).
- Fast‑timescale (milliseconds to seconds): Decide how much CPU/GPU each active task receives to meet its deadline.
LLM‑Based Placement Agent
- The agent is prompted with a concise description of the current edge topology, workload mix, and SLO targets.
- It outputs a placement plan (e.g., “move Service A to Node 3”). The LLM’s reasoning ability helps capture complex constraints (e.g., co‑location of related services).
Predictive Migration Critic
- Before any migration, the critic estimates the interruption time (e.g., container warm‑up, model loading).
- It compares this cost against the projected SLO improvement from the new placement. Migration proceeds only if the net benefit is positive.
Fast‑Timescale Convex Scheduler
- Formulates each task’s deadline as a linear constraint on allocated compute cycles.
- The objective minimizes total deadline violation while respecting the GPU/CPU capacity limits.
- Because the problem is convex and has a closed‑form solution, the scheduler runs in microseconds, enabling real‑time adjustments.
Integration Loop
- The LLM agent runs periodically (e.g., every 5 min).
- The critic filters its suggestions.
- The convex scheduler continuously reallocates resources based on the current placement.

Results & Findings

Metric	HAF	Best Baseline	Improvement
Overall SLO fulfillment	90.0 %	69.5 %	+20.5 %
AI service request success	85.3 %	51.0 %	+34.3 %
RAN function deadline miss rate	4.2 %	12.8 %	‑8.6 %
Migration‑induced interruption (avg.)	0.12 s	0.31 s	‑0.19 s

Robustness: HAF maintained its edge across low, medium, and high load scenarios, with only modest performance dips under extreme overload.
Critic Effectiveness: Across three open‑source LLM agents (GPT‑2‑small, LLaMA‑7B, Falcon‑40B), the critic consistently added 3–7 % SLO gain by suppressing harmful migrations.
Latency: The convex allocator solved the resource‑allocation problem in < 0.5 ms per scheduling interval, well within the real‑time requirements of 5G/6G RAN functions.

Practical Implications

Edge Operators: Deploying HAF can let telecom operators host AI inference (e.g., video analytics, predictive maintenance) alongside latency‑critical RAN functions without over‑provisioning hardware.
AI Service Providers: The framework offers a “plug‑and‑play” placement API that automatically decides the optimal edge node, reducing the need for manual capacity planning.
Developer Tooling: The fast convex scheduler can be exposed as a library (e.g., a Rust or Go crate) for any edge‑native workload that needs deadline‑aware CPU/GPU throttling.
Cost Savings: By avoiding unnecessary migrations and improving resource packing, operators can achieve up to 30 % lower hardware spend while still meeting 5G/6G SLOs.
Standardization Path: HAF’s clear separation of placement (slow) and allocation (fast) aligns with emerging ETSI MEC and O‑RAN interfaces, making integration into existing orchestration stacks straightforward.

Limitations & Future Work

LLM Prompt Engineering: The placement quality depends on well‑crafted prompts; suboptimal prompts can degrade decisions. Automating prompt generation is an open challenge.
Model Loading Overheads: The current migration cost model assumes linear warm‑up time; real GPU memory fragmentation or large model checkpoints may introduce non‑linear delays.
Scalability to Hundreds of Nodes: Experiments were limited to a 5‑node testbed. Scaling the hierarchical control loop to city‑wide edge clusters will require hierarchical aggregation or federated critics.
Security & Trust: Relying on LLM reasoning raises concerns about explainability and potential policy violations; future work will explore verifiable reasoning traces.

Overall, HAF demonstrates a promising route to harmonize AI workloads with ultra‑low‑latency RAN functions at the edge, offering a practical blueprint for next‑generation AI‑RAN deployments.

Authors

Haiyuan Li
Yulei Wu
Dimitra Simeonidou

Paper Information

arXiv ID: 2605.07547v1
Categories: cs.DC, cs.NI, eess.SY
Published: May 8, 2026
PDF: Download PDF

[Paper] Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole