[Paper] Benchmarking Political Persuasion Risks Across Frontier Large Language Models

Published: 14 hours ago (March 10, 2026 at 12:42 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.09884v1

Overview

A new study puts the most advanced large language models (LLMs) under the microscope to see how well they can sway political opinions. By running two massive online surveys with ≈ 19 k participants, the authors compare seven frontier models—from Anthropic’s Claude to xAI’s Grok—against traditional political ads, revealing that many of these AI systems are actually more persuasive than the campaigns they’re meant to emulate.

Key Contributions

Large‑scale persuasion benchmark: First systematic evaluation of seven state‑of‑the‑art LLMs on political persuasion across bipartisan issues and opposing stances.
Empirical finding that LLMs beat conventional ads: All tested models outperform standard campaign advertisements, with notable variation (Claude > others, Grok < others).
Model‑specific prompt effects: Information‑rich prompts boost persuasiveness for Claude and Grok but hurt it for GPT‑style models, contradicting earlier work that treated prompts as universally beneficial.
LLM‑assisted conversation analysis toolkit: Introduces a data‑driven, strategy‑agnostic method to automatically surface the rhetorical tactics each model employs.
Cross‑model risk assessment framework: Provides a reusable benchmark for future studies to compare persuasive risks as new models are released.

Methodology

Survey design – Two online experiments were built on a popular panel platform. Participants were randomly assigned to read a short persuasive text (generated by an LLM or a real‑world political ad) and then answer a stance‑change question.
Model suite – Seven frontier LLMs were queried via their public APIs: Anthropic Claude 1/2, OpenAI GPT‑4/3.5, Google Gemini, and xAI Grok.
Prompt variants – For each model, the authors tested a baseline prompt (plain persuasive request) and an information‑rich prompt that supplies factual context before asking the model to argue.
Outcome metric – Persuasiveness was measured as the proportion of respondents who shifted their position toward the stance advocated in the text, controlling for demographics and prior ideology.
Conversation analysis – Generated texts were fed into a lightweight LLM‑assisted pipeline that tags rhetorical devices (e.g., appeal to authority, fear, social proof) without hand‑crafting rules, enabling a comparative look at the strategies each model prefers.

Results & Findings

Model	Persuasion lift vs. ad*	Effect of info‑rich prompt
Claude (both versions)	+12 pp (percentage points) – highest	+4 pp
GPT‑4	+8 pp	‑5 pp
GPT‑3.5	+6 pp	‑3 pp
Gemini	+7 pp	+1 pp (neutral)
Grok	+3 pp – lowest	+2 pp

*Lift measured as the increase in stance change relative to the baseline political ad.

The advantage held across all issue domains (e.g., climate policy, gun control) and for both pro‑ and anti‑positions.
Claude’s texts leaned heavily on social proof and authority cues, while Grok relied more on emotional framing.
The divergent prompt effects suggest that “one‑size‑fits‑all” prompting strategies are unsafe for risk‑mitigation; a prompt that helps one model can backfire on another.

Practical Implications

Policy & compliance teams should treat LLM‑generated political content as a higher‑risk vector than traditional ads, especially when using Claude‑style models.
Developers of AI‑assisted content platforms need model‑aware safeguards: prompt sanitization, usage‑based throttling, or model‑specific persuasion detectors.
Ad‑tech and political consulting firms can leverage the findings to audit third‑party AI services, ensuring they don’t unintentionally amplify persuasive power beyond legal limits (e.g., FEC regulations).
Tooling – The paper’s conversation‑analysis pipeline can be integrated into moderation pipelines to flag texts that employ high‑impact rhetorical tactics, regardless of the underlying model.
Future LLM releases should be benchmarked against this dataset to certify that persuasive capabilities stay within acceptable bounds before public deployment.

Limitations & Future Work

Sample bias: The surveys relied on a single online panel; results may differ with other demographics or offline populations.
Scope of issues: Only a handful of bipartisan topics were tested; niche or highly polarized issues could exhibit different dynamics.
Prompt space: While the study examined baseline vs. information‑rich prompts, many other prompt engineering techniques (e.g., chain‑of‑thought, persona‑setting) remain unexplored.
Model updates: Frontier models evolve rapidly; the benchmark will need periodic re‑evaluation to stay relevant.

Future research directions include expanding the issue set, testing multi‑turn conversational persuasion, and building open‑source detectors that adapt to the model‑specific rhetorical signatures uncovered in this work.

Authors

Zhongren Chen
Joshua Kalla
Quan Le

Paper Information

arXiv ID: 2603.09884v1
Categories: cs.CL, cs.CY
Published: March 10, 2026
PDF: Download PDF

[Paper] Benchmarking Political Persuasion Risks Across Frontier Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] CREATE: Testing LLMs for Associative Creativity

[Paper] Think Before You Lie: How Reasoning Improves Honesty

[Paper] Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions

[Paper] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs