[Paper] Uncovering Political Bias in Large Language Models using Parliamentary Voting Records

Published: 3 weeks ago (January 13, 2026 at 01:18 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.08785v1

Overview

The paper presents a systematic way to measure political bias in large language models (LLMs) by comparing the models’ “voting” on legislative motions with actual parliamentary voting records from the Netherlands, Norway, and Spain. By grounding the evaluation in real‑world voting data, the authors expose consistent left‑leaning or centrist slants in state‑of‑the‑art LLMs and a noticeable negative bias toward right‑conservative parties.

Key Contributions

Benchmark construction pipeline – A reusable method that turns parliamentary motions and party‑level vote tallies into a bias‑testing suite for any LLM.
Three multilingual benchmarks – PoliBiasNL (Dutch, 2.7 k motions), PoliBiasNO (Norwegian, 10.6 k motions), and PoliBiasES (Spanish, 2.5 k motions), covering 15, 9, and 10 parties respectively.
Ideology‑mapping visualisation – A technique that projects both LLMs and political parties onto the two‑dimensional CHES (economic vs. cultural) space, enabling direct visual comparison.
Empirical findings – Across all three datasets, leading LLMs (e.g., GPT‑4, Claude, Llama 2) show a systematic left‑centrist tilt and a measurable negative bias against right‑conservative parties.
Open‑source resources – The authors release the benchmark data, code for generating model predictions, and the visualisation toolkit, encouraging reproducibility and extension to other countries.

Methodology

Data collection – The authors scraped official parliamentary archives to obtain every motion (bill, amendment, or resolution) and the corresponding yes/no vote of each party.
Prompt design – For each motion, a concise natural‑language description is fed to an LLM with a “Should a party support this motion? Answer yes or no.” The model’s answer is treated as a simulated vote.
Aggregation – Model votes are aggregated per party, yielding a synthetic voting record that can be directly compared to the real record.
Bias metrics –
- Ideological distance: Euclidean distance between a model’s party‑level vote vector and the CHES coordinates of that party.
- Party bias score: Average difference between model‑predicted support for a given party and the party’s actual support across motions.
Visualization – Both parties and models are plotted in the CHES space (economic left‑right, cultural progressive‑conservative), making bias patterns instantly readable.

The pipeline is deliberately model‑agnostic: any LLM that can answer yes/no questions can be slotted in, and the same code works for any parliamentary dataset that follows the same schema.

Results & Findings

Model (example)	Overall ideological tilt	Bias toward right‑conservative parties
GPT‑4	Center‑left (≈ 0.3 on economic axis)	Consistently predicts lower support for right‑wing parties (‑0.12 average bias)
Claude 2.1	Slightly left (≈ 0.2)	Negative bias of similar magnitude
Llama 2‑13B	Center (≈ 0.0)	Small but statistically significant negative bias

Fine‑grained distinctions: Even within the “left‑leaning” cluster, models differ on cultural issues (e.g., immigration, civil liberties), mirroring the two‑dimensional CHES layout.
Cross‑national consistency: The left‑centrist tilt appears in all three countries despite differences in party systems and issue salience, suggesting a systematic artifact of training data or model architecture rather than a locale‑specific effect.
Statistical robustness: Bias scores survive bootstrapped confidence intervals (95 % CI does not include zero) and remain after controlling for motion length, topic, and voting turnout.

Practical Implications

Product risk assessment – Companies embedding LLMs in recommendation engines, chatbots, or policy‑analysis tools can now run a quick “political bias audit” using the released benchmarks, flagging potential misalignments with user expectations or regulatory standards.
Content moderation – Understanding a model’s predisposition toward certain ideological frames helps design guardrails that prevent inadvertent political persuasion or skewed fact‑checking.
Fine‑tuning & alignment – The voting‑based feedback loop offers a concrete, quantifiable target for reinforcement‑learning‑from‑human‑feedback (RLHF) pipelines: penalize predictions that deviate from a neutral party‑vote distribution.
Cross‑border deployments – Since the methodology works for any parliamentary dataset, multinational firms can evaluate bias in the local political context before launching LLM‑powered services in new markets.
Transparency for regulators – The visual CHES mapping provides an interpretable artifact that can be shared with auditors or policymakers to demonstrate compliance with fairness guidelines.

Limitations & Future Work

Prompt sensitivity – The binary “yes/no” framing may oversimplify nuanced legislative language; alternative prompt styles could yield different bias profiles.
Coverage bias – Benchmarks rely on motions that are publicly documented and translated; less‑documented or highly localized issues might be under‑represented.
Static snapshot – The study evaluates models at a single point in time; continual re‑evaluation is needed as models are updated or retrained.
Cultural dimensions – CHES captures only two axes; other political spectra (e.g., environmentalism, populism) are not directly modeled.
Future directions – Extending the pipeline to non‑parliamentary political signals (e.g., party manifestos, social media discourse), exploring multi‑choice or graded voting scales, and integrating bias mitigation directly into the training loop.

Authors

Jieying Chen
Karen de Jong
Andreas Poole
Jan Burakowski
Elena Elderson Nosti
Joep Windt
Chendi Wang

Paper Information

arXiv ID: 2601.08785v1
Categories: cs.AI
Published: January 13, 2026
PDF: Download PDF

[Paper] Uncovering Political Bias in Large Language Models using Parliamentary Voting Records

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management