[Paper] Mitigating Social Bias in English and Urdu Language Models Using PRM-Guided Candidate Selection and Sequential Refinement
Source: arXiv - 2512.09854v1
Overview
Large language models (LLMs) are becoming the go‑to interface for everything from chat assistants to code generators, but they often spew biased or stereotypical content—especially when the prompt touches on gender, ethnicity, religion, or other sensitive topics. This paper investigates inference‑time bias mitigation (i.e., fixing the output without retraining the model) for both English and Urdu, a low‑resource language that typically suffers the most from data‑driven inequities.
Key Contributions
- Unified evaluation framework that pits three inference‑time strategies against each other:
- Baseline single‑word generation (the raw LLM output).
- PRM‑Select (best‑of‑N) – generate N candidates with GPT‑3.5, score them with a Preference‑Ranking Model (PRM) and pick the least biased.
- PRM‑Sequential refinement – iteratively improve a single candidate using PRM‑generated critiques.
- Cross‑lingual bias benchmark: 200 English prompts and their Urdu translations covering gender, ethnicity, religion, nationality, disability, profession, age, and socioeconomic status.
- Two‑model pipeline: GPT‑3.5 as the candidate generator, GPT‑4o‑mini as the PRM scorer (bias + utility).
- Quantitative metrics for bias reduction, utility preservation, and fairness gaps between English and Urdu.
- Open‑source‑ready methodology that can be plugged into any existing LLM deployment pipeline.
Methodology
- Prompt Collection & Translation – Curated a balanced set of socially sensitive prompts in English, then professionally translated them into Urdu while preserving cultural nuance.
- Candidate Generation – For each prompt, GPT‑3.5 produces N (typically 5) completions.
- PRM Scoring – GPT‑4o‑mini, fine‑tuned as a Preference‑Ranking Model, evaluates each candidate on two axes:
- Bias score (how much the text aligns with stereotypical or harmful narratives).
- Utility score (fluency, relevance, and task completion).
The PRM outputs a combined ranking.
- Selection Strategies
- Baseline: Use the top‑ranked candidate from GPT‑3.5 directly.
- PRM‑Select: Pick the candidate with the best combined PRM score from the N pool.
- PRM‑Sequential: Start with the best raw candidate, then ask the PRM to critique and suggest edits; repeat for a fixed number of iterations (usually 2‑3).
- Evaluation – Compute fairness metrics (e.g., demographic parity, bias amplification) and utility metrics (BLEU, ROUGE, human rating) for each method, separately for English and Urdu.
Results & Findings
| Method | Bias Reduction (↑) | Utility Retention (↓ loss) | Fairness Gap (Urdu‑English) |
|---|---|---|---|
| Baseline | 0% (reference) | 0% | 0.12 |
| PRM‑Select | 38% average bias drop | 5% utility loss | 0.09 |
| PRM‑Sequential | 45% bias drop (best) | 9% utility loss (more edits) | 0.07 |
- Both PRM‑based methods outperform the raw baseline by a wide margin in both languages.
- Urdu consistently scores lower on fairness (higher residual bias) than English, confirming the authors’ hypothesis that low‑resource languages inherit structural inequities from multilingual training corpora.
- PRM‑Select is more “plug‑and‑play” (single pass, minimal latency) while PRM‑Sequential yields the strongest bias mitigation at the cost of extra inference steps.
- Utility (fluency, relevance) remains high; the modest drop is acceptable for most user‑facing applications.
Practical Implications
- Deployable as a middleware layer: Teams can wrap any existing LLM API with a PRM‑Select or PRM‑Sequential wrapper, achieving bias mitigation without costly fine‑tuning.
- Low‑resource language support: The framework highlights where Urdu (and by extension similar languages) needs extra attention—e.g., larger N or more refinement steps—to close the fairness gap.
- Regulatory compliance: Companies subject to AI fairness guidelines can adopt this inference‑time guardrail to demonstrate proactive bias reduction.
- Cost‑effective: Since only inference is required, the approach scales with existing compute budgets; the PRM scorer can be a smaller, cheaper model (GPT‑4o‑mini) compared to full‑scale fine‑tuning.
- Extensible to other modalities: The same PRM‑guided selection can be applied to code generation, summarization, or translation pipelines where bias manifests differently.
Limitations & Future Work
- Reliance on the PRM’s own biases: The bias scorer (GPT‑4o‑mini) is itself a language model and may inherit blind spots, especially for culturally specific Urdu contexts.
- Latency overhead: PRM‑Sequential adds multiple inference rounds, which may be prohibitive for real‑time chat applications.
- Prompt coverage: The benchmark, while diverse, still represents a finite set of social categories; rare or intersectional biases could be missed.
- Scalability to larger N: Generating more candidates improves selection quality but linearly increases API costs.
- Future directions suggested by the authors include:
- Training dedicated, lightweight PRMs for each target language,
- Exploring reinforcement‑learning‑based refinement to reduce iteration count, and
- Extending the framework to multimodal LLMs (e.g., vision‑language models).
Authors
- Muneeb Ur Raheem Khan
Paper Information
- arXiv ID: 2512.09854v1
- Categories: cs.CL
- Published: December 10, 2025
- PDF: Download PDF