[Paper] A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks
Source: arXiv - 2604.21399v1
Overview
The paper introduces a novel framework that lets resource‑limited wireless devices tap into large language models (LLMs) by intelligently splitting and scheduling inference tasks across local hardware and nearby edge Wi‑Fi access points. By combining LLM‑driven task decomposition with a latency‑aware scheduler, the authors demonstrate up to 20 % lower response times and an 80 % boost in a composite “reward” metric compared with naïve local‑only or nearest‑edge offloading strategies.
Key Contributions
- LLM‑based Planner: A lightweight, distilled model that predicts how hard each sub‑task will be and estimates the length of the generated output tokens.
- Task Decomposition Mechanism: Enables a single user request to be broken into multiple subtasks that can be processed in parallel on heterogeneous devices (phone, edge AP, or both).
- Decomposition‑Aware Scheduler: Jointly optimizes subtask placement, execution order, and result aggregation while respecting Wi‑Fi contention, queueing delays, and compute limits.
- Comprehensive Simulation Suite: Evaluates the framework in a multi‑user, multi‑edge Wi‑Fi environment, showing a 20 % latency reduction and an 80 % improvement in a latency‑accuracy reward metric over baseline approaches.
- Distillation Pipeline: Shows that a compact planner (≈10 % of the teacher LLM size) can achieve near‑teacher performance, making edge deployment feasible.
Methodology
- System Model – The authors model a Wi‑Fi network with several users, each equipped with a modest CPU/GPU, and multiple edge access points (APs) that host more powerful LLM inference engines.
- Planner Design – A large “teacher” LLM first learns to (a) split a user query into logical subtasks, (b) predict each subtask’s computational difficulty, and (c) estimate the number of output tokens. Knowledge distillation then creates a lightweight “student” planner that runs on the edge or even on the device itself.
- Decomposition‑Aware Scheduling – Using the planner’s predictions, the scheduler solves a mixed‑integer optimization problem that decides:
- Which subtasks stay local vs. go to which AP.
- The order of execution to respect communication bandwidth and queuing constraints.
- How to aggregate partial results into the final answer.
The problem is approximated with a greedy heuristic that runs in real‑time.
- Evaluation – Simulations vary the number of users, Wi‑Fi contention levels, and LLM sizes. Baselines include (i) pure local inference, (ii) offloading the whole request to the nearest AP, and (iii) a random split‑and‑offload scheme.
Results & Findings
| Metric | Local‑Only | Nearest‑Edge | Proposed Framework |
|---|---|---|---|
| Avg. End‑to‑End Latency | 1.45 s | 1.20 s | 0.96 s (≈20 % ↓) |
| Composite Reward (latency × accuracy) | 0.62 | 0.71 | 1.28 (≈80 % ↑) |
| Planner Size (parameters) | – | – | 12 M (student) vs. 120 M (teacher) |
| Planner Inference Time | – | – | 3 ms on edge CPU |
- Latency‑Accuracy Trade‑off: By sending only the “hard” subtasks to the edge and keeping simple ones local, the system avoids unnecessary network delays while still leveraging the edge’s superior compute power for the heavy lifting.
- Planner Efficiency: The distilled planner adds negligible overhead, confirming its suitability for real‑time edge deployment.
- Scalability: As the number of concurrent users grows, the scheduler gracefully balances load, preventing any single AP from becoming a bottleneck.
Practical Implications
- Mobile Apps with LLM Features: Developers can embed sophisticated conversational or reasoning capabilities (e.g., code assistance, on‑device summarization) without requiring a high‑end GPU on the phone.
- Edge‑First AI Services: Enterprises deploying private Wi‑Fi networks (e.g., factories, campuses) can offer LLM‑powered tools with predictable latency, improving user experience for AR/VR, real‑time translation, or decision‑support dashboards.
- Cost Savings: By offloading only the most demanding subtasks, operators avoid over‑provisioning edge hardware, leading to lower CAPEX and energy consumption.
- Network‑Aware AI: The framework demonstrates a concrete path toward “network‑intelligent” AI pipelines that adapt to contention and bandwidth fluctuations, a capability increasingly important for 5G/6G edge ecosystems.
Limitations & Future Work
- Simulation‑Only Validation: The study relies on synthetic workloads and a simulated Wi‑Fi environment; real‑world experiments (e.g., on commercial routers) are needed to confirm robustness.
- Static Scheduler Heuristic: The current greedy algorithm may not capture long‑term optimality under highly dynamic traffic; reinforcement‑learning‑based schedulers could be explored.
- Security & Privacy: Splitting queries across devices raises concerns about data leakage; future work should integrate encryption or secure multi‑party computation.
- Generalization to Other Modalities: Extending the decomposition approach to multimodal models (vision‑language, audio) remains an open research direction.
Authors
- Mingqi Han
- Xinghua Sun
Paper Information
- arXiv ID: 2604.21399v1
- Categories: cs.DC, cs.NI
- Published: April 23, 2026
- PDF: Download PDF