[Paper] LFM2 Technical Report

Published: 2 months ago (November 28, 2025 at 12:56 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.23404v1

Overview

The LFM2 technical report introduces Liquid Foundation Models (LFM2)—a new family of large language models engineered for fast, low‑memory inference on edge devices such as smartphones, laptops, and embedded CPUs. By marrying a hardware‑in‑the‑loop architecture search with novel training tricks, the authors deliver models that are up to 2× quicker than comparably sized alternatives while still hitting top‑tier benchmark scores.

Key Contributions

Hybrid backbone design: Combines gated short‑range convolutions with a handful of grouped‑query attention blocks, dramatically cutting latency on CPUs.
Hardware‑in‑the‑loop NAS: Architecture search explicitly optimizes for edge latency and memory limits, rather than just FLOPs or parameter count.
Scalable model family: Six dense variants (350 M – 2.6 B parameters) plus an 8.3 B mixture‑of‑experts (MoE) model that activates only 1.5 B parameters per token. All support a 32 K context window.
Training pipeline innovations:
- Tempered, decoupled Top‑K knowledge distillation that avoids “support mismatch” between teacher and student.
- Curriculum learning that feeds data in increasing difficulty order.
- Three‑stage post‑training recipe (supervised fine‑tuning → length‑normalized preference optimization → model merging).
Multimodal extensions:
- LFM2‑VL (vision‑language) with token‑efficient visual front‑ends for adjustable accuracy‑latency trade‑offs.
- LFM2‑Audio (speech‑to‑speech) with separate audio encoder/decoder pipelines enabling real‑time interaction.
- LFM2‑ColBERT (retrieval) offering low‑latency multilingual query/document encoding.
Open‑source deployment bundles: Ready‑to‑run packages for ExecuTorch, llama.cpp, and vLLM, facilitating immediate edge deployment.

Methodology

Architecture Search with Real‑World Constraints
- The authors run a neural‑architecture‑search loop that measures actual inference time and memory usage on target CPUs.
- The search space mixes short convolutions (fast, local pattern capture) with grouped‑query attention (lightweight global context).
Training Regimen
- Tempered Top‑K Distillation: The student model learns from a teacher’s top‑K logits, but the temperature is gradually annealed to keep the learning signal stable across training stages.
- Curriculum Data Ordering: Training data are sorted by difficulty (e.g., token entropy) so the model first masters easy patterns before tackling harder ones.
- Post‑Training Three‑Stage Recipe:
  - Supervised fine‑tuning on task‑specific data.
  - Length‑normalized preference optimization (a lightweight RLHF‑style step that respects the 32 K context).
  - Model merging to blend multiple fine‑tuned checkpoints for robustness.
Multimodal Adaptations
- Visual tokens are generated by a lightweight CNN‑based tokenizer that can be throttled to meet latency budgets.
- Audio pipelines split encoding/decoding, allowing streaming inference with sub‑second latency.
Evaluation
- Trained on 10–12 trillion tokens across web text, code, and multimodal corpora.
- Benchmarked on standard language (IFEval, GSM8K), vision‑language (VQAv2, COCO), speech (LibriSpeech, VCTK), and retrieval (MS‑MARCO, multilingual BEIR) suites.

Results & Findings

Model	Params	IFEval	GSM8K	VQAv2 (VL)	LibriSpeech (Audio)	Retrieval (ColBERT)
LFM2‑350M	0.35 B	71.2%	74.8%	68.5%	9.2 % WER	71.3 % MRR
LFM2‑2.6B	2.6 B	79.56%	82.41%	78.1%	6.8 % WER	78.9 % MRR
LFM2‑MoE (8.3 B/1.5 B)	8.3 B (1.5 B active)	81.3%	84.7%	80.4%	5.9 % WER	81.2 % MRR

Latency: On a typical laptop CPU (Intel i7‑12700H), LFM2‑2.6B’s pre‑fill and decode are ~2× faster than a dense 2.6 B LLaMA‑2 baseline while using ~30 % less RAM.
Multimodal trade‑offs: LFM2‑VL can drop visual token resolution by 50 % with only a 2‑3 % accuracy hit, enabling sub‑100 ms image‑conditioned generation on a phone‑class SoC.
Real‑time speech: LFM2‑Audio achieves ≤ 150 ms end‑to‑end latency for speech‑to‑speech, comparable to models that are three times larger.

Overall, the study demonstrates that architectural co‑design with hardware constraints can deliver edge‑ready foundation models without sacrificing state‑of‑the‑art performance.

Practical Implications

Edge AI products: Developers can embed a 2.6 B LFM2 model directly into mobile apps, wearables, or IoT gateways for on‑device chat, summarization, or code assistance, eliminating reliance on cloud APIs and reducing latency/privacy concerns.
Real‑time multimodal assistants: LFM2‑VL’s tunable visual token pipeline makes it feasible to build AR assistants that answer visual queries instantly on a headset.
Speech‑to‑speech bots: LFM2‑Audio’s streaming architecture enables low‑latency voice assistants or translation devices that run on a single CPU core.
Search & Retrieval services: LFM2‑ColBERT provides a fast, multilingual encoder that can be deployed in latency‑critical search back‑ends or personal knowledge‑base tools without GPU acceleration.
Open‑source ecosystem: The provided ExecuTorch, llama.cpp, and vLLM packages mean teams can drop‑in the models into existing inference stacks, accelerating prototyping and production rollout.

Limitations & Future Work

Scaling ceiling on pure CPU: While LFM2‑MoE reduces active parameters, the routing overhead still incurs a modest latency penalty compared to dense models of the same active size.
Domain‑specific fine‑tuning: Extreme low‑resource domains (e.g., medical jargon) still benefit from additional supervised data; the current curriculum does not explicitly target such niches.
Hardware diversity: The NAS was performed on a limited set of x86 CPUs; extending the search to ARM‑based SoCs, GPUs, or emerging NPUs could uncover even better trade‑offs.
Robustness & Alignment: Preference optimization focuses on length‑normalized rewards; broader alignment (e.g., safety, factuality) remains an open research direction.

Future work is slated to explore dynamic sparsity that adapts attention patterns at runtime, cross‑modal curriculum learning, and automated deployment pipelines that tailor the model family to a developer’s exact hardware budget.

Authors

Alexander Amini
Anna Banaszak
Harold Benoit
Arthur Böök
Tarek Dakhran
Song Duong
Alfred Eng
Fernando Fernandes
Marc Härkönen
Anne Harrington
Ramin Hasani
Saniya Karwa
Yuri Khrustalev
Maxime Labonne
Mathias Lechner
Valentine Lechner
Simon Lee
Zetian Li
Noel Loo
Jacob Marks
Edoardo Mosca
Samuel J. Paech
Paul Pak
Rom N. Parnichkun
Alex Quach
Ryan Rogers
Daniela Rus
Nayan Saxena
Bettina Schlager
Tim Seyde
Jimmy T. H. Smith
Aditya Tadimeti
Neehal Tumma

Paper Information

arXiv ID: 2511.23404v1
Categories: cs.LG, cs.AI
Published: November 28, 2025
PDF: Download PDF

[Paper] LFM2 Technical Report

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval