[Paper] Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures
Source: arXiv - 2603.17952v1
Overview
The paper investigates how modern decoder‑only large language models (LLMs) handle gender when translating between languages that mark gender differently (e.g., English → French). While these models achieve top‑tier translation quality, they still inherit systematic gender biases. The authors propose a new diagnostic metric—Prior Bias—to expose a model’s default gender assumptions and evaluate whether recent instruction‑tuning tricks can mitigate those biases.
Key Contributions
- Prior Bias metric: a quantitative measure of a model’s “default” gender choice before any contextual clues are considered.
- Extension to decoder‑only MT: adapts a previously encoder‑decoder‑focused bias framework for models like GPT‑3/4 that generate translations directly from the source text.
- Comprehensive diagnostic suite: combines Prior Bias with existing gender‑specific evaluation sets (e.g., WinoMT, BUG) to capture both overt and subtle bias patterns.
- Empirical comparison: shows that raw decoder‑only models are not inherently better than encoder‑decoder systems on gender‑sensitive metrics.
- Impact of post‑training: demonstrates that instruction tuning (or other fine‑tuning regimes) reduces masculine Prior Bias and improves contextual gender awareness.
Methodology
- Data Construction – The authors curate a set of bilingual sentence pairs where the source language (English) contains ambiguous gender cues (e.g., “The doctor said …”) and the target language (French, Spanish, etc.) requires an explicit gendered noun or verb form.
- Prior Bias Estimation – For each ambiguous source sentence, they generate translations without any gender‑specific context (e.g., by stripping pronouns or using neutral prompts). The proportion of masculine vs. feminine forms in these outputs defines the Prior Bias.
- Model Families – Experiments cover:
- Decoder‑only LLMs (GPT‑Neo, LLaMA, GPT‑3.5) in zero‑shot mode.
- The same models after instruction‑tuning on translation‑oriented datasets.
- Classic encoder‑decoder MT systems (Marian, mBART) as baselines.
- Evaluation Metrics – Besides Prior Bias, they report:
- Accuracy on gender‑specific test sets (how often the correct gender is chosen).
- BLEU/ChrF for overall translation quality (to ensure bias fixes don’t degrade fluency).
- Error analysis by categorizing failures (e.g., pronoun vs. occupational nouns).
Results & Findings
- Baseline Decoder‑Only Models: Exhibit a strong masculine Prior Bias (≈ 70‑80 % masculine forms) and only modest gains in gender‑accuracy over encoder‑decoder baselines.
- Instruction‑Tuned Models: Reduce Prior Bias dramatically (down to ≈ 45‑55 % masculine) and improve gender‑accuracy by 5‑10 % points, while maintaining comparable BLEU scores.
- No Universal Superiority: Even the largest decoder‑only models (e.g., GPT‑3.5) do not consistently outperform strong encoder‑decoder MT systems on gender‑specific metrics.
- Contextual Sensitivity: Post‑training improves the model’s ability to leverage explicit gender cues (pronouns, titles) but still struggles with subtle world‑knowledge cues (e.g., stereotypical occupations).
Practical Implications
- Product Teams: If you’re deploying LLM‑based translation (e.g., in chatbots or multilingual documentation tools), you can’t rely on model size alone to solve gender bias; targeted instruction tuning is essential.
- Prompt Engineering: Simple prompts that surface gender cues (e.g., “Translate, preserving the gender of the subject”) can help, but systematic fine‑tuning yields more reliable results.
- Compliance & Ethics: The Prior Bias metric offers a quick audit tool for compliance teams to flag models that default to masculine forms, supporting GDPR‑style fairness assessments.
- Tooling: The diagnostic suite can be integrated into CI pipelines for MT services, automatically surfacing regressions in gender handling after model updates.
Limitations & Future Work
- Language Scope: Experiments focus on a handful of gender‑rich target languages; extending to low‑resource or non‑binary‑friendly languages remains open.
- Metric Granularity: Prior Bias captures only the default tendency; it does not reflect how models handle intersecting biases (e.g., gender + race).
- Instruction Tuning Data: The study uses publicly available translation instruction data; custom domain‑specific instruction sets might yield different bias dynamics.
- Human Evaluation: While automatic metrics are informative, deeper human judgments on perceived fairness and naturalness are needed for production‑grade validation.
Bottom line: Decoder‑only LLMs are powerful, but without careful post‑training they inherit the same gender bias patterns as traditional MT systems. The new Prior Bias metric and the authors’ diagnostic framework give developers a practical way to measure and mitigate those biases before shipping multilingual products.
Authors
- Chiara Manna
- Hosein Mohebbi
- Afra Alishahi
- Frédéric Blain
- Eva Vanmassenhove
Paper Information
- arXiv ID: 2603.17952v1
- Categories: cs.CL
- Published: March 18, 2026
- PDF: Download PDF