[Paper] Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Published: 1 day ago (March 4, 2026 at 01:29 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.04364v1

Overview

Multimodal web agents—systems that read both a page’s visual screenshot and its underlying accessibility tree—are becoming the backbone of automated browsing, testing, and assistive technologies. This paper uncovers a hidden vulnerability: an attacker who tampers with the page’s DOM can simultaneously poison both visual and textual inputs, creating a “cross‑modal” deception that defeats current safety measures. To counter this, the authors introduce Dual‑Modality Multi‑Stage Adversarial Safety Training (DMAST), a three‑phase co‑training regime that teaches the agent to stay task‑focused even when both channels are corrupted.

Key Contributions

Cross‑modal attack taxonomy – Demonstrates that injecting malicious DOM elements that affect both screenshot and accessibility tree dramatically outperforms text‑only attacks on the MiniWob++ benchmark.
Formal game‑theoretic framing – Models the agent–attacker interaction as a two‑player zero‑sum Markov game, enabling principled adversarial training.
Three‑stage training pipeline (DMAST)
1. Imitation learning from a high‑performing teacher model to bootstrap competence.
2. Oracle‑guided supervised fine‑tuning with a novel zero‑acknowledgment loss that forces the agent to ignore deceptive cues and focus on the true task goal.
3. Adversarial reinforcement learning using Group Relative Policy Optimization (GRPO) for self‑play between the agent and a learned attacker.
Empirical gains – On out‑of‑distribution web tasks, DMAST cuts adversarial success rates by >70 % while doubling task‑completion efficiency compared to prior defenses.
Generalization proof – Shows robust performance on unseen, more complex web environments, indicating the method scales beyond the training suite.

Methodology

Problem setup – The web agent receives two streams: (a) a rendered screenshot and (b) an accessibility tree (a structured textual representation of UI elements). An attacker can inject arbitrary HTML/JS into the DOM, which instantly changes both streams.
Markov game formulation – The interaction is cast as a turn‑based game: the attacker perturbs the page, then the agent selects an action (e.g., click, type). Rewards are defined by task success (positive) vs. safety violations (negative).
Stage 1: Imitation Learning – A strong teacher policy (trained on clean data) generates expert trajectories. The agent learns to mimic these actions, giving it a solid baseline.
Stage 2: Zero‑Acknowledgment Supervision – An oracle knows the true task goal regardless of the injected noise. The loss penalizes any agent response that “acknowledges” the attacker’s deceptive cues, encouraging the model to rely on invariant reasoning patterns.
Stage 3: Adversarial RL with GRPO – Both agent and attacker are represented as neural policies. They are trained simultaneously via self‑play. GRPO adjusts the policy gradient to compare each group’s performance relative to a moving baseline, stabilizing learning in the highly non‑stationary adversarial setting.

All components are built on top of a standard multimodal transformer (e.g., CLIP‑style encoder) that fuses visual and textual embeddings before feeding them to a decision head.

Results & Findings

Metric	Clean baseline	Text‑only attack	Cross‑modal attack	DMAST (clean)	DMAST (attack)
Task success rate	84 %	31 %	12 %	86 %	71 %
Steps to completion (lower is better)	18	45	62	16	22
Adversarial success (attacker win)	5 %	38 %	68 %	4 %	9 %

Cross‑modal attacks degrade performance far more than text‑only injections, confirming the hypothesis that dual‑stream agents have a larger attack surface.
DMAST restores most of the lost capability: success rates under attack climb from 12 % to 71 %, while the number of steps needed drops by ~65 %.
Compared with existing defenses (prompt‑tuning, adversarial text augmentation), DMAST yields 2–3× better robustness and ~2× higher efficiency on unseen tasks.

Practical Implications

Safer web automation – Companies building bots for form‑filling, UI testing, or accessibility assistance can adopt DMAST to protect against malicious pages that try to mislead the bot (e.g., phishing‑style DOM injections).
Robust assistive tools – Screen‑reader‑enhanced agents that help users with disabilities will be less prone to being hijacked by malicious web content, improving trustworthiness.
Security‑by‑design for multimodal AI – The game‑theoretic training loop can be transplanted to other dual‑input systems (e.g., video + audio assistants), encouraging a broader shift toward co‑evolutionary safety pipelines.
Developer tooling – The authors release a lightweight GRPO library and a MiniWob++‑style attack generator, enabling rapid prototyping of adversarial training for custom web agents.

Limitations & Future Work

Scalability to full‑scale browsers – Experiments are confined to MiniWob++‑style synthetic pages; real‑world sites with heavy JavaScript and dynamic layout may introduce new failure modes.
Attacker model expressiveness – The current attacker is limited to DOM injection; more sophisticated threats (e.g., timing attacks, CSS‑based visual tricks) are not covered.
Compute cost – The three‑stage pipeline, especially the adversarial RL phase, requires several GPU‑days, which may be prohibitive for small teams.
Future directions suggested by the authors include extending DMAST to continuous‑learning settings (online self‑play on live web traffic), incorporating richer multimodal cues (audio, haptic), and exploring curriculum‑based attacker curricula to further close the robustness gap.

Authors

Haoyu Liu
Dingcheng Li
Lukas Rutishauser
Zeyu Zheng

Paper Information

arXiv ID: 2603.04364v1
Categories: cs.LG, cs.AI, cs.CL
Published: March 4, 2026
PDF: Download PDF

[Paper] Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] $τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

[Paper] World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

[Paper] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

[Paper] CONCUR: Benchmarking LLMs for Concurrent Code Generation