[Paper] Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN

Published: (May 8, 2026 at 06:22 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.07547v1

Overview

The paper introduces Deadline‑Driven Hierarchical Agentic Resource Sharing (HAF), a two‑layer control system that lets edge‑deployed AI services and real‑time Radio Access Network (RAN) functions coexist on the same GPU‑accelerated hardware. By marrying a slow‑timescale large language model (LLM) planner with a fast, deadline‑aware convex optimizer, HAF dramatically improves service‑level objective (SLO) compliance while keeping migration overhead low.

Key Contributions

  • Hierarchical Agentic Framework (HAF): Combines an LLM‑based placement agent (slow timescale) with a closed‑form convex allocator (fast timescale) to handle mismatched scheduling horizons.
  • Predictive Migration Critic: A lightweight predictor that evaluates whether moving a service would cause more interruption than SLO gain, preventing unnecessary migrations.
  • Deadline‑Aware Convex Allocation: Derives a fast, analytically solvable resource‑allocation formula that respects per‑task deadlines on CPU/GPU slices.
  • Comprehensive Evaluation: Shows 90 % overall SLO fulfillment (≈20 % better than the strongest baseline) and lifts AI request success from 51 % to 85.3 % across varied load patterns.
  • Open‑Source LLM Compatibility: Demonstrates that the critic improves SLO outcomes for multiple publicly available LLM agents, highlighting the approach’s portability.

Methodology

  1. Problem Decomposition

    • Slow‑timescale (minutes to hours): Decide where each AI service and RAN function should run (which edge node).
    • Fast‑timescale (milliseconds to seconds): Decide how much CPU/GPU each active task receives to meet its deadline.
  2. LLM‑Based Placement Agent

    • The agent is prompted with a concise description of the current edge topology, workload mix, and SLO targets.
    • It outputs a placement plan (e.g., “move Service A to Node 3”). The LLM’s reasoning ability helps capture complex constraints (e.g., co‑location of related services).
  3. Predictive Migration Critic

    • Before any migration, the critic estimates the interruption time (e.g., container warm‑up, model loading).
    • It compares this cost against the projected SLO improvement from the new placement. Migration proceeds only if the net benefit is positive.
  4. Fast‑Timescale Convex Scheduler

    • Formulates each task’s deadline as a linear constraint on allocated compute cycles.
    • The objective minimizes total deadline violation while respecting the GPU/CPU capacity limits.
    • Because the problem is convex and has a closed‑form solution, the scheduler runs in microseconds, enabling real‑time adjustments.
  5. Integration Loop

    • The LLM agent runs periodically (e.g., every 5 min).
    • The critic filters its suggestions.
    • The convex scheduler continuously reallocates resources based on the current placement.

Results & Findings

MetricHAFBest BaselineImprovement
Overall SLO fulfillment90.0 %69.5 %+20.5 %
AI service request success85.3 %51.0 %+34.3 %
RAN function deadline miss rate4.2 %12.8 %‑8.6 %
Migration‑induced interruption (avg.)0.12 s0.31 s‑0.19 s
  • Robustness: HAF maintained its edge across low, medium, and high load scenarios, with only modest performance dips under extreme overload.
  • Critic Effectiveness: Across three open‑source LLM agents (GPT‑2‑small, LLaMA‑7B, Falcon‑40B), the critic consistently added 3–7 % SLO gain by suppressing harmful migrations.
  • Latency: The convex allocator solved the resource‑allocation problem in < 0.5 ms per scheduling interval, well within the real‑time requirements of 5G/6G RAN functions.

Practical Implications

  • Edge Operators: Deploying HAF can let telecom operators host AI inference (e.g., video analytics, predictive maintenance) alongside latency‑critical RAN functions without over‑provisioning hardware.
  • AI Service Providers: The framework offers a “plug‑and‑play” placement API that automatically decides the optimal edge node, reducing the need for manual capacity planning.
  • Developer Tooling: The fast convex scheduler can be exposed as a library (e.g., a Rust or Go crate) for any edge‑native workload that needs deadline‑aware CPU/GPU throttling.
  • Cost Savings: By avoiding unnecessary migrations and improving resource packing, operators can achieve up to 30 % lower hardware spend while still meeting 5G/6G SLOs.
  • Standardization Path: HAF’s clear separation of placement (slow) and allocation (fast) aligns with emerging ETSI MEC and O‑RAN interfaces, making integration into existing orchestration stacks straightforward.

Limitations & Future Work

  • LLM Prompt Engineering: The placement quality depends on well‑crafted prompts; suboptimal prompts can degrade decisions. Automating prompt generation is an open challenge.
  • Model Loading Overheads: The current migration cost model assumes linear warm‑up time; real GPU memory fragmentation or large model checkpoints may introduce non‑linear delays.
  • Scalability to Hundreds of Nodes: Experiments were limited to a 5‑node testbed. Scaling the hierarchical control loop to city‑wide edge clusters will require hierarchical aggregation or federated critics.
  • Security & Trust: Relying on LLM reasoning raises concerns about explainability and potential policy violations; future work will explore verifiable reasoning traces.

Overall, HAF demonstrates a promising route to harmonize AI workloads with ultra‑low‑latency RAN functions at the edge, offering a practical blueprint for next‑generation AI‑RAN deployments.

Authors

  • Haiyuan Li
  • Yulei Wu
  • Dimitra Simeonidou

Paper Information

  • arXiv ID: 2605.07547v1
  • Categories: cs.DC, cs.NI, eess.SY
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »