[Paper] PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

Published: 4 days ago (May 6, 2026 at 01:29 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.05159v1

Overview

The paper describes a winning‑ish solution for SemEval‑2026 Task 9, which challenges participants to detect political polarization in short texts across 22 languages. By fine‑tuning large multilingual Gemma models with low‑rank adapters and enriching the training data with carefully crafted synthetic examples, the authors achieve a macro‑F1 of 0.811, landing them in second place overall and first in several languages.

Key Contributions

Per‑language fine‑tuning of two Gemma 3 models (12 B and 27 B parameters) using LoRA, allowing efficient adaptation without full model retraining.
Synthetic data pipeline that creates three types of augmentations (direct generation, paraphrase, contrastive pairs) via GPT‑4o‑mini, followed by multi‑stage quality filtering and embedding‑based deduplication.
Dynamic threshold tuning on the development set per language, delivering a consistent 2–4 % boost in F1 without extra training.
Weighted ensemble strategy that combines predictions from the 12 B and 27 B models, with language‑specific selection of the best‑performing configuration.
Empirical insight that strong development‑set performers (e.g., XLM‑RoBERTa, Qwen‑3) can dramatically under‑perform on the blind test set, underscoring the need for robust generalization techniques.

Methodology

Base Models – The authors start from the open‑source Gemma 3 family (12 B and 27 B parameters), which already support 100+ languages.
LoRA Adaptation – Instead of full fine‑tuning, they inject low‑rank matrices into each transformer layer, drastically reducing GPU memory and training time while preserving the bulk of the pretrained knowledge.
Synthetic Data Generation
- Direct Generation: Prompt GPT‑4o‑mini to write new polarized / non‑polarized sentences in the target language.
- Paraphrasing: Feed existing labeled sentences to the LLM and request paraphrases that keep the original label.
- Contrastive Pairs: Ask the LLM to produce a minimally altered version that flips the label, creating hard negative examples.
Quality Filtering – Each synthetic batch passes through:
- Heuristic checks (language detection, profanity, length).
- LLM‑based validation (prompt the LLM to re‑classify the sentence).
- Embedding deduplication (FAISS index to drop near‑duplicates).
Training – LoRA adapters are trained on the union of original and filtered synthetic data for each language separately.
Inference Tweaks – After training, the authors sweep decision thresholds on the dev set per language, storing the optimal value for test‑time scoring.
Ensembling – Predictions from the 12 B and 27 B adapters are combined using a weighted average; weights are chosen per language based on dev‑set performance.

Results & Findings

Metric (macro‑F1)	Overall	Best language	3‑language wins
System	0.811	0.872 (language X)	3 (languages A, B, C)
Rank (SemEval)	2nd of 27 teams	—	—

Threshold tuning added +2–4 % absolute F1 across languages.
Synthetic data contributed roughly +5 % F1 over a baseline trained only on the original dataset.
Ensemble vs. single model: the weighted combo outperformed the best single Gemma model by ~1.8 % macro‑F1.
Alternative architectures (XLM‑RoBERTa, Qwen‑3) showed 30–50 % F1 drops on the blind test set, highlighting over‑fitting to the development data.

Practical Implications

Low‑cost multilingual adaptation – LoRA lets teams fine‑tune 27 B‑scale models on a single GPU, making high‑quality multilingual classifiers accessible to startups and research labs without massive compute budgets.
Synthetic data as a universal booster – The three‑pronged augmentation strategy can be repurposed for any binary (or even multi‑class) text classification task, especially when labeled data is scarce in low‑resource languages.
Per‑language thresholding – Simple post‑hoc calibration can squeeze out measurable gains without any extra training, a trick easily integrated into production pipelines.
Robustness over “big‑model” hype – The stark performance gap between development and test sets for XLM‑RoBERTa/Qwen‑3 warns practitioners to validate on out‑of‑distribution data rather than relying solely on leaderboard scores.
Ensemble flexibility – Weighted ensembles that switch per language can be deployed as a single API endpoint that internally selects the best model, delivering consistent quality across a multilingual user base.

Limitations & Future Work

Synthetic data quality dependence – The pipeline relies heavily on GPT‑4o‑mini; any biases or hallucinations in the LLM can propagate into the training set.
Scalability to >22 languages – While LoRA reduces compute, maintaining separate adapters per language may become cumbersome as the language count grows.
Threshold tuning overhead – Requires a dev set for each language; in truly zero‑resource scenarios this step may be infeasible.
Model size constraints – Even with LoRA, inference with 27 B parameters can be latency‑heavy for real‑time applications; exploring quantization or distillation could mitigate this.

Future research directions include automating per‑language adapter selection, investigating multilingual LoRA that shares parameters across related languages, and extending the synthetic augmentation framework to multi‑label polarization or stance detection tasks.

Authors

Srikar Kashyap Pulipaka

Paper Information

arXiv ID: 2605.05159v1
Categories: cs.CL, cs.AI, cs.LG
Published: May 6, 2026
PDF: Download PDF

[Paper] PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims