[Paper] Floe: Federated Specialization for Real-Time LLM-SLM Inference

Published: 3 days ago (February 15, 2026 at 03:28 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.14302v1

Overview

Deploying massive language models (LLMs) on latency‑sensitive devices—think voice assistants, AR glasses, or on‑device code helpers—has long been a trade‑off between performance, privacy, and compute cost. Floe proposes a hybrid federated learning architecture that lets edge devices keep personal data and fine‑tuning locally while still tapping into the knowledge of a cloud‑hosted “black‑box” LLM. The result is a system that delivers faster, more private, and more personalized responses without shipping huge model weights to every device.

Key Contributions

Hybrid federated inference pipeline that couples a cloud‑side LLM with on‑device small language models (SLMs) for real‑time generation.
Privacy‑first design: user data never leaves the device; only lightweight logits are exchanged, preserving proprietary model weights on the cloud.
Heterogeneity‑aware LoRA adaptation: a low‑rank fine‑tuning technique that automatically tailors the SLM to a wide range of edge hardware (CPU, GPU, NPU, etc.).
Logit‑level fusion engine: a fast, token‑by‑token combination of cloud and edge predictions that respects real‑time constraints.
Comprehensive evaluation showing up to 45 % latency reduction and 12 % accuracy (or relevance) gains over standard edge‑only or cloud‑only baselines.

Methodology

Model Partitioning – The cloud hosts a full‑scale LLM (e.g., GPT‑3‑class) that remains a black box. Each edge device runs a compact SLM (≈10‑30 M parameters).
Federated LoRA Fine‑Tuning – Devices locally collect user interactions and apply Low‑Rank Adaptation (LoRA) to the SLM. The LoRA updates are aggregated in a federated manner, allowing the SLM to benefit from collective knowledge while staying lightweight.
Real‑Time Logit Fusion – For every generated token, the edge SLM produces a probability distribution (logits) which is sent to the cloud. The cloud LLM processes the same prompt, returns its logits, and the two are merged using a weighted sum that can be tuned per application (e.g., more weight to personalization on‑device).
Latency‑Aware Scheduling – A scheduler monitors network RTT and device compute load; if the cloud response would miss the real‑time deadline, the system gracefully falls back to edge‑only generation.
Evaluation Suite – Benchmarks span conversational QA, code completion, and on‑device command understanding, measured on Raspberry Pi 4, Qualcomm Snapdragon 8 Gen 2, and a desktop GPU for the cloud side.

Results & Findings

Metric	Edge‑Only SLM	Cloud‑Only LLM	Floe (Hybrid)
End‑to‑end latency (ms)	210	480 (network + compute)	120
Top‑1 accuracy (benchmark)	71 %	78 %	84 %
Personalization gain (Δ from generic)	+3 %	–	+9 %
Data transmitted per query (KB)	0	1500 (full model)	≈30

Key takeaways

Latency drops dramatically because the edge SLM handles the bulk of token generation, only needing occasional cloud logits.
Performance improves beyond either extreme; the cloud LLM injects world knowledge while the edge SLM injects user‑specific context.
Privacy is upheld—no raw user text leaves the device; only compressed logits (≈30 KB) are sent.

Practical Implications

Voice assistants & chatbots can answer personalized queries in <150 ms on modest hardware, opening doors for offline‑first experiences.
Enterprise SaaS can keep proprietary LLM weights on secure servers while still delivering low‑latency, customized suggestions to employee devices.
Edge AI developers gain a reusable LoRA‑based pipeline to quickly adapt SLMs to new hardware without full model retraining.
Network‑constrained scenarios (e.g., rural IoT, in‑flight entertainment) benefit from the fallback‑to‑edge mode, guaranteeing service continuity even with spotty connectivity.

Limitations & Future Work

Fusion weighting is currently static per application; dynamic, context‑aware weighting could further boost quality.
The approach assumes a reliable, low‑latency uplink for logit exchange; extreme bandwidth throttling may force pure edge mode, reducing the cloud knowledge benefit.
Experiments focus on English‑centric benchmarks; multilingual or multimodal extensions remain to be explored.
Security of the logit channel is not deeply analyzed—future work should harden against inference attacks that could reconstruct user inputs from logits.

Authors

Chunlin Tian
Kahou Tam
Yebo Wu
Shuaihang Zhong
Li Li
Nicholas D. Lane
Chengzhong Xu

Paper Information

arXiv ID: 2602.14302v1
Categories: cs.DC, cs.LG
Published: February 15, 2026
PDF: Download PDF

[Paper] Floe: Federated Specialization for Real-Time LLM-SLM Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Ensemble-size-dependence of deep-learning post-processing methods that minimize an (un)fair score: motivating examples and a proof-of-concept solution

[Paper] Operationalising the Superficial Alignment Hypothesis via Task Complexity

[Paper] Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation

[Paper] Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching