[Paper] AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

Published: (February 18, 2026 at 12:28 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.16639v1

Overview

The paper introduces AREG (Adversarial Resource Extraction Game), a new benchmark that pits large language models (LLMs) against each other in a multi‑turn negotiation where one model tries to persuade the other to hand over a fictitious sum of money. By framing persuasion and resistance as opposite sides of a zero‑sum game, the authors can evaluate both offensive (how well a model can influence) and defensive (how well it can resist influence) abilities in a single, realistic dialogue setting.

Key Contributions

  • A unified adversarial benchmark – AREG combines persuasion and resistance evaluation in one interactive, turn‑based game, moving beyond static prompt‑response tests.
  • Round‑robin tournament framework – All frontier LLMs are pitted against each other, yielding a dense matrix of pairwise performance scores.
  • Empirical dissociation of skills – Persuasion and resistance show only a weak correlation (ρ ≈ 0.33), demonstrating that strong performance in one does not guarantee strength in the other.
  • Linguistic insight into successful tactics – Incremental, commitment‑seeking prompts boost extraction success, while verification‑seeking replies dominate effective defenses.
  • Open‑source implementation & data – The authors release the game engine, dialogue logs, and evaluation scripts for reproducibility and community extension.

Methodology

  1. Game design – Two agents (LLMs) engage in a bounded conversation. The attacker tries to convince the defender to “transfer” a specified amount of virtual money. The defender’s goal is to refuse or stall without violating policy constraints.
  2. Turn‑based interaction – Each dialogue consists of up to 10 turns (alternating speaker). The attacker may ask questions, offer incentives, or apply pressure; the defender can ask for clarification, request proof, or outright refuse.
  3. Scoring
    • Persuasion score: proportion of games where the attacker succeeds in extracting the target amount.
    • Resistance score: proportion of games where the defender prevents extraction.
    • Both scores are normalized to account for differing model capacities and token limits.
  4. Round‑robin tournament – Every evaluated model plays both roles against every other model, generating a full performance matrix.
  5. Analysis pipeline – The authors compute correlation metrics, perform linguistic pattern mining (e.g., n‑gram frequency, dialogue act tagging), and run statistical tests to link specific conversational moves to outcomes.

Results & Findings

Model (example)Persuasion %Resistance %
Model A (GPT‑4)3862
Model B (Claude)4555
Model C (Llama‑2)3169
  • Defensive edge: Across all tested models, resistance scores consistently outpace persuasion scores, indicating a systematic advantage when the model is on the defensive.
  • Weak correlation: The Pearson correlation between a model’s persuasion and resistance scores is only 0.33, confirming that the two abilities are largely independent.
  • Strategic patterns:
    • Attackers who gradually increase commitment (“Would you be willing to transfer $10 now?”) achieve higher success rates than those who demand the full amount outright.
    • Defenders that ask verification questions (“Can you prove you own the account?”) succeed more often than those that simply say “No.”
  • Interaction structure matters: Turn order and the presence of clarification requests heavily influence outcomes, suggesting that evaluation should consider dialogue dynamics, not just isolated responses.

Practical Implications

  • Security‑aware product design: Developers building chat‑based assistants should test both how easily their model can be coerced into revealing or “transferring” sensitive information and how robust it is at resisting such attempts. AREG provides a ready‑made stress test for this dual capability.
  • Fine‑tuning for balanced social intelligence: Since persuasion and resistance are not automatically aligned, teams may need separate training objectives (e.g., reinforcement learning from human feedback for defensive behavior) to shore up the weaker side.
  • Policy compliance monitoring: The benchmark surfaces failure modes where a model might comply with malicious requests after a series of incremental prompts—a pattern that mirrors real‑world social engineering attacks.
  • Benchmarking beyond static prompts: AREG’s multi‑turn setup can be integrated into existing evaluation pipelines (e.g., OpenAI’s evals, Hugging Face’s evaluate library) to complement traditional metrics like truthfulness or toxicity.
  • Tooling for red‑teamers: Security researchers can use the released game engine to craft custom adversarial scenarios (e.g., phishing, credential extraction) and measure the effectiveness of mitigation strategies.

Limitations & Future Work

  • Synthetic stakes: The “money” in the game is fictional, so the psychological pressure may not fully capture real‑world high‑value social engineering.
  • Model‑specific policy filters: Some models have built‑in refusal mechanisms that could inflate resistance scores; disentangling learned defensive skill from hard‑coded safety layers remains a challenge.
  • Limited dialogue length: Ten turns may be insufficient for more elaborate negotiation tactics; extending the horizon could reveal deeper strategic behavior.
  • Generalization to other domains: The current setup focuses on financial extraction; future work could adapt AREG to data leakage, misinformation spread, or collaborative problem solving to broaden its applicability.

Overall, AREG opens a practical pathway for developers to probe the nuanced social dynamics of LLMs, highlighting that persuasion and resistance are distinct competencies that both deserve dedicated attention in safe AI development.

Authors

  • Adib Sakhawat
  • Fardeen Sadab

Paper Information

  • arXiv ID: 2602.16639v1
  • Categories: cs.CL
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »