[Paper] Prune4Web: DOM Tree Pruning Programming for Web Agent

Published: 5 months ago (November 26, 2025 at 08:49 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2511.21398v1

Overview

Web‑automation agents powered by large language models (LLMs) still stumble when faced with today’s massive web pages—DOM trees that can contain tens of thousands of nodes. Prune4Web flips the script: instead of forcing the LLM to read the whole DOM, it lets the model emit a tiny Python “filter script” that programmatically prunes the tree down to the elements that matter for the current sub‑task. The result is a dramatic speed‑up and a leap in grounding accuracy, making LLM‑driven agents far more practical for real‑world web tasks.

Key Contributions

DOM Tree Pruning Programming: A novel paradigm where the LLM generates executable Python scoring scripts that filter DOM elements based on semantic cues from decomposed sub‑tasks.
Two‑turn Dialogue Training: Joint optimization of a Planner (task decomposition), a Programmatic Filter (the pruning script), and a Grounder (action selection) within a unified framework.
Efficient Annotation Pipeline: A tailored data‑creation process that supplies high‑quality supervision for both the pruning scripts and grounding decisions.
Scalable Reduction: Achieves a 25×–50× shrinkage of candidate DOM nodes, drastically cutting the attention load on the LLM.
State‑of‑the‑Art Performance: Boosts low‑level grounding accuracy from 46.8 % to 88.28 % on the authors’ benchmark, surpassing prior LLM‑based web agents.

Methodology

Task Decomposition (Planner) – The LLM first breaks a high‑level user request (e.g., “book a flight”) into a sequence of concrete sub‑tasks (e.g., “click the date picker”, “select destination”).
Program Generation (Programmatic Filter) – For each sub‑task, the same LLM emits a short Python script that scores every DOM node using lightweight heuristics (text similarity, attribute patterns, CSS classes, etc.). The script returns a ranked list of “relevant” elements.
Pruning Execution – The generated script runs on the raw DOM outside the LLM, discarding the vast majority of nodes and leaving only a few hundred candidates.
Grounding (Grounder) – A second LLM pass receives the pruned candidate set plus the sub‑task description and selects the exact element to interact with (click, type, etc.).
Two‑Turn Dialogue – The system iterates: the Planner proposes the next sub‑task, the Filter prunes, the Grounder acts, and feedback (success/failure) is fed back into the next turn, allowing the model to refine its scripts on‑the‑fly.

All components are trained end‑to‑end on a curated dataset of web‑automation episodes, using a mix of supervised signals (correct scripts, correct grounding) and reinforcement‑style feedback from execution outcomes.

Results & Findings

Metric	Baseline (LLM‑only)	Prune4Web
Low‑level grounding accuracy	46.8 %	88.28 %
Avg. candidate DOM nodes per step	~30 k	~600 (≈ 25×–50× reduction)
End‑to‑end task success (complex multi‑step)	31 %	57 %
Inference latency per step	2.8 s	0.4 s

What it means: By offloading DOM traversal to tiny Python scripts, the LLM can focus its attention on a compact, semantically rich subset of the page, eliminating “attention dilution” that previously caused mis‑grounded actions. The accuracy jump shows that the pruned view is not only smaller but also more relevant.

Practical Implications

Faster Web Bots – Developers can embed Prune4Web into existing automation pipelines (e.g., Selenium, Playwright) and see order‑of‑magnitude speed gains without sacrificing reliability.
Lower Compute Costs – Reducing the token count fed to the LLM cuts API usage and GPU memory, making large‑scale deployments (e.g., SaaS UI‑automation) economically viable.
Explainable Filters – The generated Python scripts are human‑readable, enabling debugging and compliance checks (e.g., ensuring a bot never clicks hidden ads).
Plug‑and‑Play with Any LLM – The approach is model‑agnostic; any instruction‑following LLM can be used to produce the filter scripts, opening the door to open‑source alternatives.
Robustness to Page Bloat – Modern web apps (single‑page frameworks, infinite scroll) often inflate the DOM; Prune4Web’s pruning stays effective regardless of size, improving reliability for e‑commerce, fintech, and internal dashboards.

Limitations & Future Work

Script Generation Errors – Occasionally the LLM emits syntactically invalid or overly permissive Python filters, requiring a fallback or retry mechanism.
Domain‑Specific Heuristics – The current scoring functions are generic; specialized sites (e.g., canvas‑based UIs) may need custom primitives.
Training Data Coverage – The annotation pipeline focuses on a curated set of web tasks; scaling to the full diversity of the web will demand larger, possibly semi‑automated datasets.
Dynamic Content – Rapidly changing DOMs (e.g., live feeds) may invalidate a previously generated filter; future work could explore incremental re‑pruning or continuous script adaptation.

Overall, Prune4Web demonstrates that moving heavy DOM processing out of the LLM’s “brain” and into lightweight, interpretable programs is a game‑changer for web‑automation agents, paving the way for faster, cheaper, and more trustworthy AI‑driven browsers.

Authors

Jiayuan Zhang
Kaiquan Chen
Zhihao Lu
Enshen Zhou
Qian Yu
Jing Zhang

Paper Information

arXiv ID: 2511.21398v1
Categories: cs.AI, cs.CL, cs.HC, cs.MA
Published: November 26, 2025
PDF: Download PDF

[Paper] Prune4Web: DOM Tree Pruning Programming for Web Agent

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation