[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis
Source: arXiv - 2512.22100v1
Overview
The paper fills a long‑standing gap in Turkish NLP by introducing TrGLUE, a GLUE‑style benchmark covering a suite of natural language understanding (NLU) tasks, and SentiTurca, a dedicated sentiment‑analysis benchmark. By providing ready‑to‑use data, annotation pipelines, and evaluation scripts, the authors give the Turkish‑language community a solid footing for comparing transformer models, LLMs, and other NLU systems on a common yardstick.
Key Contributions
- TrGLUE benchmark: 8–10 Turkish‑native NLU tasks (e.g., sentence classification, textual entailment, paraphrase detection) modeled after the original GLUE suite.
- SentiTurca: a large‑scale, domain‑balanced sentiment‑analysis dataset covering product reviews, social media, and news comments.
- Semi‑automated annotation pipeline: combines strong LLM‑generated labels, cross‑model agreement filtering, and a final human validation step to ensure high label quality while keeping costs low.
- Open‑source tooling: end‑to‑end fine‑tuning and evaluation scripts for Hugging Face‑compatible transformer models, enabling reproducible experiments out of the box.
- Empirical baselines: comprehensive performance tables for BERT‑base, RoBERTa‑turkish, and several recent LLMs, highlighting the current state of Turkish NLU.
Methodology
- Task selection & data sourcing – The authors curated existing Turkish corpora (news, forums, question‑answer sites) and reformulated them into GLUE‑style formats (single‑sentence, sentence‑pair, and multi‑choice).
- Label generation – For tasks lacking human annotations, a strong Turkish LLM (e.g., a fine‑tuned mT5) generated provisional labels. Multiple model runs were compared; only examples with high inter‑model agreement were kept for human review.
- Human validation – A small team of native Turkish speakers performed spot‑checks and corrected noisy instances, ensuring that the final benchmark reflects natural language use rather than translation artifacts.
- Benchmark construction – Each task is split into train/dev/test sets following the GLUE convention, with balanced class distributions and domain diversity.
- Evaluation framework – The authors released a Python package that wraps the 🤗 Transformers trainer, automatically computes task‑specific metrics (accuracy, F1, Matthews correlation, etc.), and logs results to TensorBoard or Weights & Biases.
Results & Findings
| Model | Avg. TrGLUE Score* | SentiTurca F1 |
|---|---|---|
| BERT‑base (multilingual) | 68.2 | 71.4 |
| RoBERTa‑turkish (large) | 74.9 | 78.1 |
| mT5‑XL (fine‑tuned) | 72.3 | 75.6 |
| GPT‑3.5‑turkish (zero‑shot) | 61.5 | 64.2 |
*Average of normalized task scores (0–100).
- Domain robustness: Models trained on TrGLUE generalized better to out‑of‑domain Turkish text than those only fine‑tuned on single tasks.
- Annotation pipeline payoff: The semi‑automated approach achieved >92 % agreement with fully human‑annotated subsets, confirming that LLM‑assisted labeling can be reliable for low‑resource languages.
- Sentiment nuance: SentiTurca revealed that many models struggle with sarcasm and code‑switching (Turkish–English), indicating room for specialized pre‑training.
Practical Implications
- Standardized evaluation: Companies building Turkish chatbots, voice assistants, or content‑moderation pipelines now have a common benchmark to gauge model upgrades and compare vendor solutions.
- Faster dataset creation: The annotation pipeline can be repurposed for new Turkish tasks (e.g., intent detection), dramatically cutting time‑to‑data for product teams.
- Model selection guidance: Baseline results suggest that a Turkish‑specific RoBERTa model is currently the safest default for most NLU workloads, while larger multilingual LLMs still lag behind on nuanced tasks.
- Open‑source integration: The provided scripts plug directly into CI pipelines (GitHub Actions, Azure ML), enabling continuous benchmarking as models evolve.
If you’re building Turkish‑language AI products, consider cloning the TrGLUE repo, running the baseline scripts on your own models, and contributing back any new task data you generate. The benchmark is designed to evolve with the community, and early adopters will shape the next generation of Turkish NLU.
Limitations & Future Work
- Task coverage: While TrGLUE spans many core NLU tasks, it lacks structured prediction tasks such as named‑entity recognition and coreference resolution, which are important for downstream applications.
- Domain bias: The benchmark leans heavily on news and product‑review domains; under‑represented dialects and informal social‑media slang may still be under‑tested.
- Human validation scale: The final human review step was performed by a relatively small pool of annotators, which could limit the detection of subtle cultural or regional nuances.
- Future directions: The authors plan to expand TrGLUE with additional tasks (e.g., QA, NER), incorporate more diverse dialectal data, and open a leaderboard to foster community‑driven model improvements.
Authors
- Duygu Altinok
Paper Information
- arXiv ID: 2512.22100v1
- Categories: cs.CL, cs.AI
- Published: December 26, 2025
- PDF: Download PDF