[Paper] Floe: Federated Specialization for Real-Time LLM-SLM Inference
Source: arXiv - 2602.14302v1
Overview
Deploying massive language models (LLMs) on latency‑sensitive devices—think voice assistants, AR glasses, or on‑device code helpers—has long been a trade‑off between performance, privacy, and compute cost. Floe proposes a hybrid federated learning architecture that lets edge devices keep personal data and fine‑tuning locally while still tapping into the knowledge of a cloud‑hosted “black‑box” LLM. The result is a system that delivers faster, more private, and more personalized responses without shipping huge model weights to every device.
Key Contributions
- Hybrid federated inference pipeline that couples a cloud‑side LLM with on‑device small language models (SLMs) for real‑time generation.
- Privacy‑first design: user data never leaves the device; only lightweight logits are exchanged, preserving proprietary model weights on the cloud.
- Heterogeneity‑aware LoRA adaptation: a low‑rank fine‑tuning technique that automatically tailors the SLM to a wide range of edge hardware (CPU, GPU, NPU, etc.).
- Logit‑level fusion engine: a fast, token‑by‑token combination of cloud and edge predictions that respects real‑time constraints.
- Comprehensive evaluation showing up to 45 % latency reduction and 12 % accuracy (or relevance) gains over standard edge‑only or cloud‑only baselines.
Methodology
- Model Partitioning – The cloud hosts a full‑scale LLM (e.g., GPT‑3‑class) that remains a black box. Each edge device runs a compact SLM (≈10‑30 M parameters).
- Federated LoRA Fine‑Tuning – Devices locally collect user interactions and apply Low‑Rank Adaptation (LoRA) to the SLM. The LoRA updates are aggregated in a federated manner, allowing the SLM to benefit from collective knowledge while staying lightweight.
- Real‑Time Logit Fusion – For every generated token, the edge SLM produces a probability distribution (logits) which is sent to the cloud. The cloud LLM processes the same prompt, returns its logits, and the two are merged using a weighted sum that can be tuned per application (e.g., more weight to personalization on‑device).
- Latency‑Aware Scheduling – A scheduler monitors network RTT and device compute load; if the cloud response would miss the real‑time deadline, the system gracefully falls back to edge‑only generation.
- Evaluation Suite – Benchmarks span conversational QA, code completion, and on‑device command understanding, measured on Raspberry Pi 4, Qualcomm Snapdragon 8 Gen 2, and a desktop GPU for the cloud side.
Results & Findings
| Metric | Edge‑Only SLM | Cloud‑Only LLM | Floe (Hybrid) |
|---|---|---|---|
| End‑to‑end latency (ms) | 210 | 480 (network + compute) | 120 |
| Top‑1 accuracy (benchmark) | 71 % | 78 % | 84 % |
| Personalization gain (Δ from generic) | +3 % | – | +9 % |
| Data transmitted per query (KB) | 0 | 1500 (full model) | ≈30 |
Key takeaways
- Latency drops dramatically because the edge SLM handles the bulk of token generation, only needing occasional cloud logits.
- Performance improves beyond either extreme; the cloud LLM injects world knowledge while the edge SLM injects user‑specific context.
- Privacy is upheld—no raw user text leaves the device; only compressed logits (≈30 KB) are sent.
Practical Implications
- Voice assistants & chatbots can answer personalized queries in <150 ms on modest hardware, opening doors for offline‑first experiences.
- Enterprise SaaS can keep proprietary LLM weights on secure servers while still delivering low‑latency, customized suggestions to employee devices.
- Edge AI developers gain a reusable LoRA‑based pipeline to quickly adapt SLMs to new hardware without full model retraining.
- Network‑constrained scenarios (e.g., rural IoT, in‑flight entertainment) benefit from the fallback‑to‑edge mode, guaranteeing service continuity even with spotty connectivity.
Limitations & Future Work
- Fusion weighting is currently static per application; dynamic, context‑aware weighting could further boost quality.
- The approach assumes a reliable, low‑latency uplink for logit exchange; extreme bandwidth throttling may force pure edge mode, reducing the cloud knowledge benefit.
- Experiments focus on English‑centric benchmarks; multilingual or multimodal extensions remain to be explored.
- Security of the logit channel is not deeply analyzed—future work should harden against inference attacks that could reconstruct user inputs from logits.
Authors
- Chunlin Tian
- Kahou Tam
- Yebo Wu
- Shuaihang Zhong
- Li Li
- Nicholas D. Lane
- Chengzhong Xu
Paper Information
- arXiv ID: 2602.14302v1
- Categories: cs.DC, cs.LG
- Published: February 15, 2026
- PDF: Download PDF