[Paper] SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Published: (June 11, 2026 at 01:50 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.13647v1

Overview

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types — nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

Key Contributions

This paper presents research in the following areas:

  • cs.CL
  • cs.AI
  • cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Marek Šuppa
  • Andrej Ridzik
  • Daniel Hládek
  • Natália Kňažeková
  • Viktória Ondrejová

Paper Information

  • arXiv ID: 2606.13647v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: June 11, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »