[Paper] Floe: Federated Specialization for Real-Time LLM-SLM Inference

Published: (February 15, 2026 at 03:28 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.14302v1

Overview

Deploying massive language models (LLMs) on latency‑sensitive devices—think voice assistants, AR glasses, or on‑device code helpers—has long been a trade‑off between performance, privacy, and compute cost. Floe proposes a hybrid federated learning architecture that lets edge devices keep personal data and fine‑tuning locally while still tapping into the knowledge of a cloud‑hosted “black‑box” LLM. The result is a system that delivers faster, more private, and more personalized responses without shipping huge model weights to every device.

Key Contributions

  • Hybrid federated inference pipeline that couples a cloud‑side LLM with on‑device small language models (SLMs) for real‑time generation.
  • Privacy‑first design: user data never leaves the device; only lightweight logits are exchanged, preserving proprietary model weights on the cloud.
  • Heterogeneity‑aware LoRA adaptation: a low‑rank fine‑tuning technique that automatically tailors the SLM to a wide range of edge hardware (CPU, GPU, NPU, etc.).
  • Logit‑level fusion engine: a fast, token‑by‑token combination of cloud and edge predictions that respects real‑time constraints.
  • Comprehensive evaluation showing up to 45 % latency reduction and 12 % accuracy (or relevance) gains over standard edge‑only or cloud‑only baselines.

Methodology

  1. Model Partitioning – The cloud hosts a full‑scale LLM (e.g., GPT‑3‑class) that remains a black box. Each edge device runs a compact SLM (≈10‑30 M parameters).
  2. Federated LoRA Fine‑Tuning – Devices locally collect user interactions and apply Low‑Rank Adaptation (LoRA) to the SLM. The LoRA updates are aggregated in a federated manner, allowing the SLM to benefit from collective knowledge while staying lightweight.
  3. Real‑Time Logit Fusion – For every generated token, the edge SLM produces a probability distribution (logits) which is sent to the cloud. The cloud LLM processes the same prompt, returns its logits, and the two are merged using a weighted sum that can be tuned per application (e.g., more weight to personalization on‑device).
  4. Latency‑Aware Scheduling – A scheduler monitors network RTT and device compute load; if the cloud response would miss the real‑time deadline, the system gracefully falls back to edge‑only generation.
  5. Evaluation Suite – Benchmarks span conversational QA, code completion, and on‑device command understanding, measured on Raspberry Pi 4, Qualcomm Snapdragon 8 Gen 2, and a desktop GPU for the cloud side.

Results & Findings

MetricEdge‑Only SLMCloud‑Only LLMFloe (Hybrid)
End‑to‑end latency (ms)210480 (network + compute)120
Top‑1 accuracy (benchmark)71 %78 %84 %
Personalization gain (Δ from generic)+3 %+9 %
Data transmitted per query (KB)01500 (full model)≈30

Key takeaways

  • Latency drops dramatically because the edge SLM handles the bulk of token generation, only needing occasional cloud logits.
  • Performance improves beyond either extreme; the cloud LLM injects world knowledge while the edge SLM injects user‑specific context.
  • Privacy is upheld—no raw user text leaves the device; only compressed logits (≈30 KB) are sent.

Practical Implications

  • Voice assistants & chatbots can answer personalized queries in <150 ms on modest hardware, opening doors for offline‑first experiences.
  • Enterprise SaaS can keep proprietary LLM weights on secure servers while still delivering low‑latency, customized suggestions to employee devices.
  • Edge AI developers gain a reusable LoRA‑based pipeline to quickly adapt SLMs to new hardware without full model retraining.
  • Network‑constrained scenarios (e.g., rural IoT, in‑flight entertainment) benefit from the fallback‑to‑edge mode, guaranteeing service continuity even with spotty connectivity.

Limitations & Future Work

  • Fusion weighting is currently static per application; dynamic, context‑aware weighting could further boost quality.
  • The approach assumes a reliable, low‑latency uplink for logit exchange; extreme bandwidth throttling may force pure edge mode, reducing the cloud knowledge benefit.
  • Experiments focus on English‑centric benchmarks; multilingual or multimodal extensions remain to be explored.
  • Security of the logit channel is not deeply analyzed—future work should harden against inference attacks that could reconstruct user inputs from logits.

Authors

  • Chunlin Tian
  • Kahou Tam
  • Yebo Wu
  • Shuaihang Zhong
  • Li Li
  • Nicholas D. Lane
  • Chengzhong Xu

Paper Information

  • arXiv ID: 2602.14302v1
  • Categories: cs.DC, cs.LG
  • Published: February 15, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »