[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Published: (March 19, 2026 at 01:59 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2603.19223v1

Overview

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

Key Contributions

This paper presents research in the following areas:

  • cs.CL
  • cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Ziyin Zhang
  • Zihan Liao
  • Hang Yu
  • Peng Di
  • Rui Wang

Paper Information

  • arXiv ID: 2603.19223v1
  • Categories: cs.CL, cs.AI
  • Published: March 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »