[Paper] LFM2 Technical Report
Source: arXiv - 2511.23404v1
Overview
The LFM2 technical report introduces Liquid Foundation Models (LFM2)—a new family of large language models engineered for fast, low‑memory inference on edge devices such as smartphones, laptops, and embedded CPUs. By marrying a hardware‑in‑the‑loop architecture search with novel training tricks, the authors deliver models that are up to 2× quicker than comparably sized alternatives while still hitting top‑tier benchmark scores.
Key Contributions
- Hybrid backbone design: Combines gated short‑range convolutions with a handful of grouped‑query attention blocks, dramatically cutting latency on CPUs.
- Hardware‑in‑the‑loop NAS: Architecture search explicitly optimizes for edge latency and memory limits, rather than just FLOPs or parameter count.
- Scalable model family: Six dense variants (350 M – 2.6 B parameters) plus an 8.3 B mixture‑of‑experts (MoE) model that activates only 1.5 B parameters per token. All support a 32 K context window.
- Training pipeline innovations:
- Tempered, decoupled Top‑K knowledge distillation that avoids “support mismatch” between teacher and student.
- Curriculum learning that feeds data in increasing difficulty order.
- Three‑stage post‑training recipe (supervised fine‑tuning → length‑normalized preference optimization → model merging).
- Multimodal extensions:
- LFM2‑VL (vision‑language) with token‑efficient visual front‑ends for adjustable accuracy‑latency trade‑offs.
- LFM2‑Audio (speech‑to‑speech) with separate audio encoder/decoder pipelines enabling real‑time interaction.
- LFM2‑ColBERT (retrieval) offering low‑latency multilingual query/document encoding.
- Open‑source deployment bundles: Ready‑to‑run packages for ExecuTorch, llama.cpp, and vLLM, facilitating immediate edge deployment.
Methodology
-
Architecture Search with Real‑World Constraints
- The authors run a neural‑architecture‑search loop that measures actual inference time and memory usage on target CPUs.
- The search space mixes short convolutions (fast, local pattern capture) with grouped‑query attention (lightweight global context).
-
Training Regimen
- Tempered Top‑K Distillation: The student model learns from a teacher’s top‑K logits, but the temperature is gradually annealed to keep the learning signal stable across training stages.
- Curriculum Data Ordering: Training data are sorted by difficulty (e.g., token entropy) so the model first masters easy patterns before tackling harder ones.
- Post‑Training Three‑Stage Recipe:
- Supervised fine‑tuning on task‑specific data.
- Length‑normalized preference optimization (a lightweight RLHF‑style step that respects the 32 K context).
- Model merging to blend multiple fine‑tuned checkpoints for robustness.
-
Multimodal Adaptations
- Visual tokens are generated by a lightweight CNN‑based tokenizer that can be throttled to meet latency budgets.
- Audio pipelines split encoding/decoding, allowing streaming inference with sub‑second latency.
-
Evaluation
- Trained on 10–12 trillion tokens across web text, code, and multimodal corpora.
- Benchmarked on standard language (IFEval, GSM8K), vision‑language (VQAv2, COCO), speech (LibriSpeech, VCTK), and retrieval (MS‑MARCO, multilingual BEIR) suites.
Results & Findings
| Model | Params | IFEval | GSM8K | VQAv2 (VL) | LibriSpeech (Audio) | Retrieval (ColBERT) |
|---|---|---|---|---|---|---|
| LFM2‑350M | 0.35 B | 71.2% | 74.8% | 68.5% | 9.2 % WER | 71.3 % MRR |
| LFM2‑2.6B | 2.6 B | 79.56% | 82.41% | 78.1% | 6.8 % WER | 78.9 % MRR |
| LFM2‑MoE (8.3 B/1.5 B) | 8.3 B (1.5 B active) | 81.3% | 84.7% | 80.4% | 5.9 % WER | 81.2 % MRR |
- Latency: On a typical laptop CPU (Intel i7‑12700H), LFM2‑2.6B’s pre‑fill and decode are ~2× faster than a dense 2.6 B LLaMA‑2 baseline while using ~30 % less RAM.
- Multimodal trade‑offs: LFM2‑VL can drop visual token resolution by 50 % with only a 2‑3 % accuracy hit, enabling sub‑100 ms image‑conditioned generation on a phone‑class SoC.
- Real‑time speech: LFM2‑Audio achieves ≤ 150 ms end‑to‑end latency for speech‑to‑speech, comparable to models that are three times larger.
Overall, the study demonstrates that architectural co‑design with hardware constraints can deliver edge‑ready foundation models without sacrificing state‑of‑the‑art performance.
Practical Implications
- Edge AI products: Developers can embed a 2.6 B LFM2 model directly into mobile apps, wearables, or IoT gateways for on‑device chat, summarization, or code assistance, eliminating reliance on cloud APIs and reducing latency/privacy concerns.
- Real‑time multimodal assistants: LFM2‑VL’s tunable visual token pipeline makes it feasible to build AR assistants that answer visual queries instantly on a headset.
- Speech‑to‑speech bots: LFM2‑Audio’s streaming architecture enables low‑latency voice assistants or translation devices that run on a single CPU core.
- Search & Retrieval services: LFM2‑ColBERT provides a fast, multilingual encoder that can be deployed in latency‑critical search back‑ends or personal knowledge‑base tools without GPU acceleration.
- Open‑source ecosystem: The provided ExecuTorch, llama.cpp, and vLLM packages mean teams can drop‑in the models into existing inference stacks, accelerating prototyping and production rollout.
Limitations & Future Work
- Scaling ceiling on pure CPU: While LFM2‑MoE reduces active parameters, the routing overhead still incurs a modest latency penalty compared to dense models of the same active size.
- Domain‑specific fine‑tuning: Extreme low‑resource domains (e.g., medical jargon) still benefit from additional supervised data; the current curriculum does not explicitly target such niches.
- Hardware diversity: The NAS was performed on a limited set of x86 CPUs; extending the search to ARM‑based SoCs, GPUs, or emerging NPUs could uncover even better trade‑offs.
- Robustness & Alignment: Preference optimization focuses on length‑normalized rewards; broader alignment (e.g., safety, factuality) remains an open research direction.
Future work is slated to explore dynamic sparsity that adapts attention patterns at runtime, cross‑modal curriculum learning, and automated deployment pipelines that tailor the model family to a developer’s exact hardware budget.
Authors
- Alexander Amini
- Anna Banaszak
- Harold Benoit
- Arthur Böök
- Tarek Dakhran
- Song Duong
- Alfred Eng
- Fernando Fernandes
- Marc Härkönen
- Anne Harrington
- Ramin Hasani
- Saniya Karwa
- Yuri Khrustalev
- Maxime Labonne
- Mathias Lechner
- Valentine Lechner
- Simon Lee
- Zetian Li
- Noel Loo
- Jacob Marks
- Edoardo Mosca
- Samuel J. Paech
- Paul Pak
- Rom N. Parnichkun
- Alex Quach
- Ryan Rogers
- Daniela Rus
- Nayan Saxena
- Bettina Schlager
- Tim Seyde
- Jimmy T. H. Smith
- Aditya Tadimeti
- Neehal Tumma
Paper Information
- arXiv ID: 2511.23404v1
- Categories: cs.LG, cs.AI
- Published: November 28, 2025
- PDF: Download PDF