[Paper] Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Source: arXiv - 2603.04364v1
Overview
Multimodal web agents—systems that read both a page’s visual screenshot and its underlying accessibility tree—are becoming the backbone of automated browsing, testing, and assistive technologies. This paper uncovers a hidden vulnerability: an attacker who tampers with the page’s DOM can simultaneously poison both visual and textual inputs, creating a “cross‑modal” deception that defeats current safety measures. To counter this, the authors introduce Dual‑Modality Multi‑Stage Adversarial Safety Training (DMAST), a three‑phase co‑training regime that teaches the agent to stay task‑focused even when both channels are corrupted.
Key Contributions
- Cross‑modal attack taxonomy – Demonstrates that injecting malicious DOM elements that affect both screenshot and accessibility tree dramatically outperforms text‑only attacks on the MiniWob++ benchmark.
- Formal game‑theoretic framing – Models the agent–attacker interaction as a two‑player zero‑sum Markov game, enabling principled adversarial training.
- Three‑stage training pipeline (DMAST)
- Imitation learning from a high‑performing teacher model to bootstrap competence.
- Oracle‑guided supervised fine‑tuning with a novel zero‑acknowledgment loss that forces the agent to ignore deceptive cues and focus on the true task goal.
- Adversarial reinforcement learning using Group Relative Policy Optimization (GRPO) for self‑play between the agent and a learned attacker.
- Empirical gains – On out‑of‑distribution web tasks, DMAST cuts adversarial success rates by >70 % while doubling task‑completion efficiency compared to prior defenses.
- Generalization proof – Shows robust performance on unseen, more complex web environments, indicating the method scales beyond the training suite.
Methodology
- Problem setup – The web agent receives two streams: (a) a rendered screenshot and (b) an accessibility tree (a structured textual representation of UI elements). An attacker can inject arbitrary HTML/JS into the DOM, which instantly changes both streams.
- Markov game formulation – The interaction is cast as a turn‑based game: the attacker perturbs the page, then the agent selects an action (e.g., click, type). Rewards are defined by task success (positive) vs. safety violations (negative).
- Stage 1: Imitation Learning – A strong teacher policy (trained on clean data) generates expert trajectories. The agent learns to mimic these actions, giving it a solid baseline.
- Stage 2: Zero‑Acknowledgment Supervision – An oracle knows the true task goal regardless of the injected noise. The loss penalizes any agent response that “acknowledges” the attacker’s deceptive cues, encouraging the model to rely on invariant reasoning patterns.
- Stage 3: Adversarial RL with GRPO – Both agent and attacker are represented as neural policies. They are trained simultaneously via self‑play. GRPO adjusts the policy gradient to compare each group’s performance relative to a moving baseline, stabilizing learning in the highly non‑stationary adversarial setting.
All components are built on top of a standard multimodal transformer (e.g., CLIP‑style encoder) that fuses visual and textual embeddings before feeding them to a decision head.
Results & Findings
| Metric | Clean baseline | Text‑only attack | Cross‑modal attack | DMAST (clean) | DMAST (attack) |
|---|---|---|---|---|---|
| Task success rate | 84 % | 31 % | 12 % | 86 % | 71 % |
| Steps to completion (lower is better) | 18 | 45 | 62 | 16 | 22 |
| Adversarial success (attacker win) | 5 % | 38 % | 68 % | 4 % | 9 % |
- Cross‑modal attacks degrade performance far more than text‑only injections, confirming the hypothesis that dual‑stream agents have a larger attack surface.
- DMAST restores most of the lost capability: success rates under attack climb from 12 % to 71 %, while the number of steps needed drops by ~65 %.
- Compared with existing defenses (prompt‑tuning, adversarial text augmentation), DMAST yields 2–3× better robustness and ~2× higher efficiency on unseen tasks.
Practical Implications
- Safer web automation – Companies building bots for form‑filling, UI testing, or accessibility assistance can adopt DMAST to protect against malicious pages that try to mislead the bot (e.g., phishing‑style DOM injections).
- Robust assistive tools – Screen‑reader‑enhanced agents that help users with disabilities will be less prone to being hijacked by malicious web content, improving trustworthiness.
- Security‑by‑design for multimodal AI – The game‑theoretic training loop can be transplanted to other dual‑input systems (e.g., video + audio assistants), encouraging a broader shift toward co‑evolutionary safety pipelines.
- Developer tooling – The authors release a lightweight GRPO library and a MiniWob++‑style attack generator, enabling rapid prototyping of adversarial training for custom web agents.
Limitations & Future Work
- Scalability to full‑scale browsers – Experiments are confined to MiniWob++‑style synthetic pages; real‑world sites with heavy JavaScript and dynamic layout may introduce new failure modes.
- Attacker model expressiveness – The current attacker is limited to DOM injection; more sophisticated threats (e.g., timing attacks, CSS‑based visual tricks) are not covered.
- Compute cost – The three‑stage pipeline, especially the adversarial RL phase, requires several GPU‑days, which may be prohibitive for small teams.
- Future directions suggested by the authors include extending DMAST to continuous‑learning settings (online self‑play on live web traffic), incorporating richer multimodal cues (audio, haptic), and exploring curriculum‑based attacker curricula to further close the robustness gap.
Authors
- Haoyu Liu
- Dingcheng Li
- Lukas Rutishauser
- Zeyu Zheng
Paper Information
- arXiv ID: 2603.04364v1
- Categories: cs.LG, cs.AI, cs.CL
- Published: March 4, 2026
- PDF: Download PDF