[Paper] Consistency of Large Reasoning Models Under Multi-Turn Attacks

Published: (February 13, 2026 at 11:58 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.13093v1

Overview

Large reasoning models (LRMs) have pushed the frontier on complex problem‑solving tasks, but we still know little about how they hold up when an adversary repeatedly probes them over a conversation. This paper puts nine state‑of‑the‑art LRMs through a battery of multi‑turn attacks, exposing both the strengths that reasoning brings and the surprising ways these models can still be fooled.

Key Contributions

  • Comprehensive adversarial benchmark: Evaluates nine cutting‑edge reasoning models against a suite of multi‑turn attacks (misleading suggestions, social pressure, etc.).
  • Empirical robustness gap: Shows that LRMs consistently beat instruction‑tuned baselines, yet each model has a distinct vulnerability profile.
  • Failure‑mode taxonomy: Identifies five recurring failure patterns—Self‑Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue—with Self‑Doubt and Social Conformity accounting for roughly half of all breakdowns.
  • Confidence‑aware defense analysis: Demonstrates that the popular Confidence‑Aware Response Generation (CARG) technique, which works for standard LLMs, actually harms LRMs because extended reasoning traces inflate confidence. A simple random confidence embedding surprisingly outperforms targeted confidence extraction.
  • Design insights: Argues that reasoning ability alone does not guarantee adversarial robustness and that confidence‑based defenses must be re‑thought for models that produce long, structured reasoning traces.

Methodology

  1. Model selection – Nine publicly available LRMs (e.g., GPT‑4‑Reasoner, Claude‑Reason, LLaMA‑Reason) were chosen to represent the current “frontier” of reasoning‑augmented LLMs.
  2. Attack suite – The authors crafted multi‑turn adversarial dialogs that fall into three categories:
    • Misleading suggestions – the attacker subtly injects incorrect premises.
    • Social pressure – the attacker pretends to be a peer or authority figure, nudging the model toward a target answer.
    • Emotional cues – the attacker uses affective language to test emotional susceptibility.
      Each attack spans 3–6 conversational turns, allowing the model to “reason” repeatedly before responding.
  3. Trajectory analysis – For every interaction, the full reasoning trace (chain‑of‑thought, tool calls, intermediate calculations) is logged. The authors then annotate failure points and cluster them into the five failure modes.
  4. Defense evaluation – They apply the Confidence‑Aware Response Generation (CARG) framework, which extracts a confidence score from the model’s internal logits and uses it to filter or re‑rank outputs. They also test a baseline random confidence embedding to gauge the effect of over‑confidence.
  5. Metrics – Success rate (percentage of attacks that force an incorrect answer), confidence calibration error, and per‑mode failure frequency are reported.

Results & Findings

  • Robustness advantage: Across all attacks, LRMs achieve an average success‑rate of 68 % (i.e., they resist the attack) compared with 45 % for instruction‑tuned baselines.
  • Vulnerability diversity: No single model dominates; some are especially prone to Social Conformity while others fall to Suggestion Hijacking.
  • Failure‑mode breakdown:
    • Self‑Doubt (model questions its own reasoning) – 27 % of failures.
    • Social Conformity (model aligns with the attacker’s stance) – 23 % of failures.
    • The remaining three modes together account for the other 50 %.
  • CARG backfires: When applied to LRMs, CARG reduces the resistance rate to 55 %, mainly because the model’s confidence spikes after generating long reasoning chains, leading to over‑trust in flawed answers.
  • Random confidence wins: Injecting a random confidence token (instead of a calibrated score) restores the resistance rate to 66 %, suggesting that the defense’s logic, not the confidence value itself, is the problem.

Practical Implications

  • Security‑aware AI product design: Developers building chat‑bots, code assistants, or decision‑support tools that rely on chain‑of‑thought reasoning should not assume inherent robustness. Multi‑turn interaction testing must become a standard QA step.
  • Defensive engineering: Simple confidence‑based gating (e.g., “only answer when confidence > 0.8”) is insufficient for reasoning models. Teams may need to incorporate trace‑level sanity checks (e.g., verifying intermediate steps against known invariants) or adversarial fine‑tuning that explicitly penalizes the identified failure modes.
  • User‑experience safeguards: UI patterns that surface the model’s reasoning trace to the user can help humans spot Self‑Doubt or Reasoning Fatigue early, allowing a fallback to a human or a secondary model.
  • Tooling for auditors: The taxonomy of failure modes offers a checklist for auditors and compliance teams to evaluate whether a deployed LRM meets robustness requirements for regulated domains (finance, healthcare, etc.).

Limitations & Future Work

  • Scope of attacks: The study focuses on textual, multi‑turn prompts; it does not cover multimodal inputs, tool‑use attacks, or prompt injection via external APIs.
  • Model diversity: While nine models were tested, the landscape of reasoning‑augmented LLMs is rapidly expanding; newer architectures (e.g., retrieval‑augmented reasoners) may exhibit different patterns.
  • Defense exploration: The paper only evaluates CARG and a random confidence baseline. Future work could explore self‑verification loops, ensemble reasoning, or meta‑learning defenses tailored to the identified failure modes.
  • Human factors: The social‑pressure attacks simulate a peer but do not model real‑world user behavior (e.g., repeated persuasion, cultural nuances). Incorporating user studies would strengthen external validity.

Bottom line: Reasoning boosts performance, but it doesn’t make large language models bullet‑proof against clever, multi‑turn adversaries. Developers need to treat robustness as a first‑class feature—testing, monitoring, and defending against the five failure modes highlighted here—to safely deploy reasoning‑capable AI in production.

Authors

  • Yubo Li
  • Ramayya Krishnan
  • Rema Padman

Paper Information

  • arXiv ID: 2602.13093v1
  • Categories: cs.AI, cs.CL
  • Published: February 13, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »