[Paper] SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Published: 3 days ago (June 11, 2026 at 01:50 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.13647v1

Overview

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types — nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

Key Contributions

This paper presents research in the following areas:

cs.CL
cs.AI
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Marek Šuppa
Andrej Ridzik
Daniel Hládek
Natália Kňažeková
Viktória Ondrejová

Paper Information

arXiv ID: 2606.13647v1
Categories: cs.CL, cs.AI, cs.LG
Published: June 11, 2026
PDF: Download PDF

[Paper] SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

[Paper] Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

[Paper] One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders