[Paper] AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

Published: 2 months ago (February 18, 2026 at 12:28 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

The paper introduces AREG (Adversarial Resource Extraction Game), a new benchmark that pits large language models (LLMs) against each other in a multi‑turn negotiation. In this setting, one model attempts to persuade the other to hand over a fictitious sum of money. By framing persuasion and resistance as opposite sides of a zero‑sum game, the authors can evaluate both offensive (how well a model can influence) and defensive (how well it can resist influence) abilities within a single, realistic dialogue scenario.

Key Contributions

Unified adversarial benchmark – AREG combines persuasion and resistance evaluation in a single interactive, turn‑based game, moving beyond static prompt‑response tests.
Round‑robin tournament framework – All frontier LLMs are pitted against each other, producing a dense matrix of pairwise performance scores.
Empirical dissociation of skills – Persuasion and resistance show only a weak correlation (ρ ≈ 0.33), demonstrating that strong performance in one does not guarantee strength in the other.
Linguistic insight into successful tactics – Incremental, commitment‑seeking prompts boost extraction success, while verification‑seeking replies dominate effective defenses.
Open‑source implementation & data – The authors release the game engine, dialogue logs, and evaluation scripts for reproducibility and community extension.

Methodology

Game Design
- Two agents (LLMs) engage in a bounded conversation.
- The attacker attempts to convince the defender to “transfer” a specified amount of virtual money.
- The defender’s goal is to refuse or stall without violating policy constraints.
Turn‑Based Interaction
- Each dialogue consists of up to 10 turns (alternating speaker).
- The attacker may ask questions, offer incentives, or apply pressure.
- The defender can request clarification, ask for proof, or outright refuse.
Scoring
- Persuasion score – proportion of games where the attacker succeeds in extracting the target amount.
- Resistance score – proportion of games where the defender prevents extraction.
- Both scores are normalized to account for differing model capacities and token limits.
Round‑Robin Tournament
- Every evaluated model plays both roles against every other model, producing a full performance matrix.
Analysis Pipeline
- Compute correlation metrics.
- Perform linguistic pattern mining (e.g., n‑gram frequency, dialogue‑act tagging).
- Run statistical tests to link specific conversational moves to outcomes.

Results & Findings

Model (example)	Persuasion %	Resistance %
Model A (GPT‑4)	38	62
Model B (Claude)	45	55
Model C (Llama‑2)	31	69

Key observations

Defensive edge – Across all tested models, resistance scores consistently outpace persuasion scores, indicating a systematic advantage when the model is on the defensive.
Weak correlation – The Pearson correlation between a model’s persuasion and resistance scores is only 0.33, confirming that the two abilities are largely independent.
Strategic patterns
- Attackers who gradually increase commitment (e.g., “Would you be willing to transfer $10 now?”) achieve higher success rates than those who demand the full amount outright.
- Defenders that ask verification questions (e.g., “Can you prove you own the account?”) succeed more often than those that simply say “No.”
Interaction structure matters – Turn order and the presence of clarification requests heavily influence outcomes, suggesting that evaluation should consider dialogue dynamics, not just isolated responses.

Practical Implications

Security‑aware product design
Developers building chat‑based assistants should test both how easily their model can be coerced into revealing or “transferring” sensitive information and how robust it is at resisting such attempts. AREG provides a ready‑made stress test for this dual capability.
Fine‑tuning for balanced social intelligence
Since persuasion and resistance are not automatically aligned, teams may need separate training objectives (e.g., reinforcement learning from human feedback for defensive behavior) to shore up the weaker side.
Policy‑compliance monitoring
The benchmark surfaces failure modes where a model might comply with malicious requests after a series of incremental prompts—a pattern that mirrors real‑world social‑engineering attacks.
Benchmarking beyond static prompts
AREG’s multi‑turn setup can be integrated into existing evaluation pipelines (e.g., OpenAI’s evals, Hugging Face’s evaluate library) to complement traditional metrics like truthfulness or toxicity.
Tooling for red‑teamers
Security researchers can use the released game engine to craft custom adversarial scenarios (e.g., phishing, credential extraction) and measure the effectiveness of mitigation strategies.

Limitations & Future Work

Synthetic stakes – The “money” in the game is fictional, so the psychological pressure may not fully capture real‑world high‑value social engineering.
Model‑specific policy filters – Some models have built‑in refusal mechanisms that could inflate resistance scores; disentangling learned defensive skill from hard‑coded safety layers remains a challenge.
Limited dialogue length – Ten turns may be insufficient for more elaborate negotiation tactics; extending the horizon could reveal deeper strategic behavior.
Generalization to other domains – The current setup focuses on financial extraction; future work could adapt AREG to data leakage, misinformation spread, or collaborative problem solving to broaden its applicability.

Overall, AREG opens a practical pathway for developers to probe the nuanced social dynamics of LLMs, highlighting that persuasion and resistance are distinct competencies that both deserve dedicated attention in safe AI development.

Authors

Adib Sakhawat
Fardeen Sadab

Paper Information

Field	Details
arXiv ID	`2602.16639v1`
Categories	`cs.CL`
Published	February 18, 2026
PDF	Download PDF

[Paper] AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

How We Handle 'Gray Area' Logic in Conversational Agents

Making Wolfram Tech Available as a Foundation Tool for LLM Systems

Apple releases videos from its 2025 AI Reasoning and Planning Workshop

Why Your AI Trading Agent Needs a Memory — and How We Built One