[Paper] PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation
Source: arXiv - 2605.05159v1
Overview
The paper describes a winning‑ish solution for SemEval‑2026 Task 9, which challenges participants to detect political polarization in short texts across 22 languages. By fine‑tuning large multilingual Gemma models with low‑rank adapters and enriching the training data with carefully crafted synthetic examples, the authors achieve a macro‑F1 of 0.811, landing them in second place overall and first in several languages.
Key Contributions
- Per‑language fine‑tuning of two Gemma 3 models (12 B and 27 B parameters) using LoRA, allowing efficient adaptation without full model retraining.
- Synthetic data pipeline that creates three types of augmentations (direct generation, paraphrase, contrastive pairs) via GPT‑4o‑mini, followed by multi‑stage quality filtering and embedding‑based deduplication.
- Dynamic threshold tuning on the development set per language, delivering a consistent 2–4 % boost in F1 without extra training.
- Weighted ensemble strategy that combines predictions from the 12 B and 27 B models, with language‑specific selection of the best‑performing configuration.
- Empirical insight that strong development‑set performers (e.g., XLM‑RoBERTa, Qwen‑3) can dramatically under‑perform on the blind test set, underscoring the need for robust generalization techniques.
Methodology
- Base Models – The authors start from the open‑source Gemma 3 family (12 B and 27 B parameters), which already support 100+ languages.
- LoRA Adaptation – Instead of full fine‑tuning, they inject low‑rank matrices into each transformer layer, drastically reducing GPU memory and training time while preserving the bulk of the pretrained knowledge.
- Synthetic Data Generation
- Direct Generation: Prompt GPT‑4o‑mini to write new polarized / non‑polarized sentences in the target language.
- Paraphrasing: Feed existing labeled sentences to the LLM and request paraphrases that keep the original label.
- Contrastive Pairs: Ask the LLM to produce a minimally altered version that flips the label, creating hard negative examples.
- Quality Filtering – Each synthetic batch passes through:
- Heuristic checks (language detection, profanity, length).
- LLM‑based validation (prompt the LLM to re‑classify the sentence).
- Embedding deduplication (FAISS index to drop near‑duplicates).
- Training – LoRA adapters are trained on the union of original and filtered synthetic data for each language separately.
- Inference Tweaks – After training, the authors sweep decision thresholds on the dev set per language, storing the optimal value for test‑time scoring.
- Ensembling – Predictions from the 12 B and 27 B adapters are combined using a weighted average; weights are chosen per language based on dev‑set performance.
Results & Findings
| Metric (macro‑F1) | Overall | Best language | 3‑language wins |
|---|---|---|---|
| System | 0.811 | 0.872 (language X) | 3 (languages A, B, C) |
| Rank (SemEval) | 2nd of 27 teams | — | — |
- Threshold tuning added +2–4 % absolute F1 across languages.
- Synthetic data contributed roughly +5 % F1 over a baseline trained only on the original dataset.
- Ensemble vs. single model: the weighted combo outperformed the best single Gemma model by ~1.8 % macro‑F1.
- Alternative architectures (XLM‑RoBERTa, Qwen‑3) showed 30–50 % F1 drops on the blind test set, highlighting over‑fitting to the development data.
Practical Implications
- Low‑cost multilingual adaptation – LoRA lets teams fine‑tune 27 B‑scale models on a single GPU, making high‑quality multilingual classifiers accessible to startups and research labs without massive compute budgets.
- Synthetic data as a universal booster – The three‑pronged augmentation strategy can be repurposed for any binary (or even multi‑class) text classification task, especially when labeled data is scarce in low‑resource languages.
- Per‑language thresholding – Simple post‑hoc calibration can squeeze out measurable gains without any extra training, a trick easily integrated into production pipelines.
- Robustness over “big‑model” hype – The stark performance gap between development and test sets for XLM‑RoBERTa/Qwen‑3 warns practitioners to validate on out‑of‑distribution data rather than relying solely on leaderboard scores.
- Ensemble flexibility – Weighted ensembles that switch per language can be deployed as a single API endpoint that internally selects the best model, delivering consistent quality across a multilingual user base.
Limitations & Future Work
- Synthetic data quality dependence – The pipeline relies heavily on GPT‑4o‑mini; any biases or hallucinations in the LLM can propagate into the training set.
- Scalability to >22 languages – While LoRA reduces compute, maintaining separate adapters per language may become cumbersome as the language count grows.
- Threshold tuning overhead – Requires a dev set for each language; in truly zero‑resource scenarios this step may be infeasible.
- Model size constraints – Even with LoRA, inference with 27 B parameters can be latency‑heavy for real‑time applications; exploring quantization or distillation could mitigate this.
Future research directions include automating per‑language adapter selection, investigating multilingual LoRA that shares parameters across related languages, and extending the synthetic augmentation framework to multi‑label polarization or stance detection tasks.
Authors
- Srikar Kashyap Pulipaka
Paper Information
- arXiv ID: 2605.05159v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: May 6, 2026
- PDF: Download PDF