[Paper] A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks

Published: 2 days ago (April 23, 2026 at 04:05 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21399v1

Overview

The paper introduces a novel framework that lets resource‑limited wireless devices tap into large language models (LLMs) by intelligently splitting and scheduling inference tasks across local hardware and nearby edge Wi‑Fi access points. By combining LLM‑driven task decomposition with a latency‑aware scheduler, the authors demonstrate up to 20 % lower response times and an 80 % boost in a composite “reward” metric compared with naïve local‑only or nearest‑edge offloading strategies.

Key Contributions

LLM‑based Planner: A lightweight, distilled model that predicts how hard each sub‑task will be and estimates the length of the generated output tokens.
Task Decomposition Mechanism: Enables a single user request to be broken into multiple subtasks that can be processed in parallel on heterogeneous devices (phone, edge AP, or both).
Decomposition‑Aware Scheduler: Jointly optimizes subtask placement, execution order, and result aggregation while respecting Wi‑Fi contention, queueing delays, and compute limits.
Comprehensive Simulation Suite: Evaluates the framework in a multi‑user, multi‑edge Wi‑Fi environment, showing a 20 % latency reduction and an 80 % improvement in a latency‑accuracy reward metric over baseline approaches.
Distillation Pipeline: Shows that a compact planner (≈10 % of the teacher LLM size) can achieve near‑teacher performance, making edge deployment feasible.

Methodology

System Model – The authors model a Wi‑Fi network with several users, each equipped with a modest CPU/GPU, and multiple edge access points (APs) that host more powerful LLM inference engines.
Planner Design – A large “teacher” LLM first learns to (a) split a user query into logical subtasks, (b) predict each subtask’s computational difficulty, and (c) estimate the number of output tokens. Knowledge distillation then creates a lightweight “student” planner that runs on the edge or even on the device itself.
Decomposition‑Aware Scheduling – Using the planner’s predictions, the scheduler solves a mixed‑integer optimization problem that decides:
- Which subtasks stay local vs. go to which AP.
- The order of execution to respect communication bandwidth and queuing constraints.
- How to aggregate partial results into the final answer.
  The problem is approximated with a greedy heuristic that runs in real‑time.
Evaluation – Simulations vary the number of users, Wi‑Fi contention levels, and LLM sizes. Baselines include (i) pure local inference, (ii) offloading the whole request to the nearest AP, and (iii) a random split‑and‑offload scheme.

Results & Findings

Metric	Local‑Only	Nearest‑Edge	Proposed Framework
Avg. End‑to‑End Latency	1.45 s	1.20 s	0.96 s (≈20 % ↓)
Composite Reward (latency × accuracy)	0.62	0.71	1.28 (≈80 % ↑)
Planner Size (parameters)	–	–	12 M (student) vs. 120 M (teacher)
Planner Inference Time	–	–	3 ms on edge CPU

Latency‑Accuracy Trade‑off: By sending only the “hard” subtasks to the edge and keeping simple ones local, the system avoids unnecessary network delays while still leveraging the edge’s superior compute power for the heavy lifting.
Planner Efficiency: The distilled planner adds negligible overhead, confirming its suitability for real‑time edge deployment.
Scalability: As the number of concurrent users grows, the scheduler gracefully balances load, preventing any single AP from becoming a bottleneck.

Practical Implications

Mobile Apps with LLM Features: Developers can embed sophisticated conversational or reasoning capabilities (e.g., code assistance, on‑device summarization) without requiring a high‑end GPU on the phone.
Edge‑First AI Services: Enterprises deploying private Wi‑Fi networks (e.g., factories, campuses) can offer LLM‑powered tools with predictable latency, improving user experience for AR/VR, real‑time translation, or decision‑support dashboards.
Cost Savings: By offloading only the most demanding subtasks, operators avoid over‑provisioning edge hardware, leading to lower CAPEX and energy consumption.
Network‑Aware AI: The framework demonstrates a concrete path toward “network‑intelligent” AI pipelines that adapt to contention and bandwidth fluctuations, a capability increasingly important for 5G/6G edge ecosystems.

Limitations & Future Work

Simulation‑Only Validation: The study relies on synthetic workloads and a simulated Wi‑Fi environment; real‑world experiments (e.g., on commercial routers) are needed to confirm robustness.
Static Scheduler Heuristic: The current greedy algorithm may not capture long‑term optimality under highly dynamic traffic; reinforcement‑learning‑based schedulers could be explored.
Security & Privacy: Splitting queries across devices raises concerns about data leakage; future work should integrate encryption or secure multi‑party computation.
Generalization to Other Modalities: Extending the decomposition approach to multimodal models (vision‑language, audio) remains an open research direction.

Authors

Mingqi Han
Xinghua Sun

Paper Information

arXiv ID: 2604.21399v1
Categories: cs.DC, cs.NI
Published: April 23, 2026
PDF: Download PDF

[Paper] A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Leveraging SIMD for Accelerating Large-number Arithmetic

[Paper] Systematizing Blockchain Research Themes and Design Patterns: Insights from the University Blockchain Research Initiative (UBRI)

[Paper] Risk-Aware and Stable Edge Server Selection Under Network Latency SLOs

[Paper] Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems