[Paper] WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

Published: 3 months ago (February 3, 2026 at 12:55 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.03792v1

Overview

Web agents—browser‑based assistants that read page content and act on user commands—are increasingly being targeted by prompt injection attacks, where malicious page elements hijack the agent’s instructions. The paper WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents introduces a practical, two‑stage system that can automatically spot and pinpoint these hidden manipulations in real‑world webpages.

Key Contributions

Two‑step detection pipeline – first isolates “segments of interest” (potentially malicious snippets) and then validates each segment against the page’s overall context.
Context‑consistency scoring – a lightweight, language‑model‑driven metric that measures how well a segment’s prompt aligns with the rest of the page.
Comprehensive benchmark – the authors curated diverse datasets of clean and contaminated pages (including e‑commerce, news, and documentation sites) to evaluate detection and localization performance.
Open‑source implementation – full code and data released, enabling reproducibility and easy integration into existing web‑agent pipelines.
Significant performance gains – WebSentinel outperforms prior baselines (rule‑based filters, single‑stage classifiers) by large margins in both precision and recall.

Methodology

Segment Extraction (Step I)
- The webpage’s DOM is parsed and broken into logical blocks (e.g., <div>, <section>, script tags).
- Heuristics such as text length, presence of code‑like patterns, and proximity to user‑visible content flag “segments of interest.”
Contextual Consistency Check (Step II)
- Each candidate segment is fed, together with the rest of the page, to a pre‑trained large language model (LLM).
- The model generates a consistency score by measuring how likely the segment’s prompt is a natural continuation of the surrounding text.
- Low‑scoring segments are flagged as potential prompt injections; the score also serves as a localization cue.

The pipeline is deliberately model‑agnostic: any LLM with a text‑completion API can be swapped in, making the approach adaptable to evolving model capabilities.

Results & Findings

Metric	Clean Pages	Contaminated Pages
Precision	0.96	0.94
Recall	0.93	0.91
F1‑score	0.95	0.92
Localization accuracy (top‑1)	–	0.88

WebSentinel consistently beats the strongest baseline (a fine‑tuned BERT classifier) by +12% F1 on contaminated pages.
The two‑step design reduces false positives dramatically; most benign scripts are ignored after Step I.
Ablation studies show that removing the context‑consistency check drops recall by ~15%, confirming its central role.

Practical Implications

Secure browser extensions & AI assistants – developers can embed WebSentinel as a pre‑flight filter, preventing malicious prompts from ever reaching the LLM backend.
Enterprise web‑scraping pipelines – automated crawlers can automatically discard or quarantine pages flagged as compromised, protecting downstream analytics.
Compliance & content moderation – the localization output pinpoints the exact DOM element, enabling targeted sanitization rather than blunt page blocking.
Low overhead – because Step I prunes the search space, the expensive LLM scoring runs on only a handful of segments per page, keeping latency suitable for interactive agents.

Limitations & Future Work

Dependence on LLM quality – the consistency score hinges on the underlying model’s understanding of the domain; niche or highly technical pages may yield noisy scores.
Evasion tactics – attackers could craft injections that mimic the surrounding context more closely, potentially lowering detection rates.
Static analysis only – the current system works on the rendered HTML; dynamic content loaded via client‑side scripts after page load is not yet covered.

Future directions include integrating runtime monitoring of JavaScript execution, exploring adversarial training to harden the consistency scorer, and extending the framework to multi‑modal agents that process images or audio embedded in webpages.

Authors

Xilong Wang
Yinuo Liu
Zhun Wang
Dawn Song
Neil Gong

Paper Information

arXiv ID: 2602.03792v1
Categories: cs.CR, cs.AI, cs.CL
Published: February 3, 2026
PDF: Download PDF

[Paper] WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] The Representational Geometry of Number