[Paper] Nested Browser-Use Learning for Agentic Information Seeking

Published: 3 weeks ago (December 29, 2025 at 12:59 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.23647v1

Overview

The paper “Nested Browser‑Use Learning for Agentic Information Seeking” tackles a practical bottleneck in modern AI assistants: most agents can only fetch raw snippets or URLs via APIs, missing out on the wealth of information hidden behind interactive web pages. By introducing a lightweight, hierarchical browser‑action framework called NestBrowse, the authors enable agents to control browsing at a high level while still digging deep into complex, dynamic sites—opening the door to richer, more reliable information‑seeking capabilities.

Key Contributions

Nested Browser‑Action API – a minimal yet complete set of actions that separates control flow (e.g., “click this button”) from content exploration (e.g., “scroll and read the page”).
NestBrowse Learning Paradigm – trains agents to issue nested actions, allowing them to reason about “when to open a new page” versus “how to extract data from the current page.”
Empirical Validation on Deep‑Web Benchmarks – demonstrates consistent performance gains over traditional ReAct‑style agents on tasks that require multi‑step navigation, form filling, and pagination.
Efficiency & Flexibility Analyses – shows that the nested design reduces the number of required API calls and can be plugged into existing LLM‑based agents with minimal code changes.

Methodology

Action Space Design
- High‑level actions (open_page, close_page) manage the browser stack.
- Low‑level actions (click, type, scroll, extract) operate within the currently active page.
- The nesting creates a tree‑like execution trace: each new page becomes a child node, preserving context while keeping the parent’s reasoning intact.
Training Loop
- The authors generate synthetic browsing trajectories using a rule‑based “oracle” that solves each benchmark task.
- These trajectories are converted into sequences of nested actions and fed to a standard LLM (e.g., GPT‑4) fine‑tuned with supervised learning.
- During inference, the model predicts the next action, the browser simulator executes it, returns a concise observation (e.g., extracted text, DOM snapshot), and the loop repeats.
Evaluation Setup
- Benchmarks include DeepWebQA, Multi‑Page Retrieval, and Form‑Filling Search, each requiring at least three navigation steps and interaction with dynamic content.
- Baselines: vanilla ReAct agents (API‑only), tool‑calling agents with flat browser actions, and a handcrafted rule‑based crawler.

Results & Findings

Benchmark	NestBrowse	ReAct‑API	Flat‑Browser	Rule‑Crawler
DeepWebQA	78.4 %	62.1 %	71.3 %	55.8 %
Multi‑Page Retrieval	84.7 %	68.9 %	77.5 %	61.2 %
Form‑Filling Search	81.2 %	65.4 %	73.0 %	58.9 %

Higher accuracy across all tasks, especially where deep navigation (>3 hops) is required.
~30 % fewer API calls compared with flat‑browser agents, thanks to the nesting that avoids redundant page reloads.
Robustness to layout changes: the hierarchical context helps the model recover when a page’s DOM shifts after a click.

Practical Implications

Richer ChatGPT‑style assistants – developers can now embed a NestBrowse module to let the assistant “look up” information that lives behind login walls, infinite scrolls, or interactive charts, delivering more up‑to‑date answers.
Enterprise knowledge retrieval – internal tools that need to scrape data from legacy web portals (e.g., ticketing systems, inventory dashboards) can be automated without writing custom scrapers for each site.
Reduced engineering overhead – the API is intentionally small; integrating it into existing LangChain or LlamaIndex pipelines requires only a few wrapper functions.
Cost efficiency – fewer round‑trips to the browser mean lower compute time and lower API usage bills for hosted LLM services.

Limitations & Future Work

Simulation vs. Real Browsers – Experiments were run on a headless Chromium simulator; performance on heavily JavaScript‑driven sites (e.g., SPAs) may differ.
Scalability of Action Sequences – Very long navigation trees (>10 levels) can still cause context overflow in current LLM token limits.
Security & Ethics – Automated browsing raises concerns about unintended scraping of copyrighted or private content; the authors call for policy‑aware action filters.
Future directions include extending NestBrowse to multi‑agent collaboration (e.g., one agent handles navigation while another focuses on reasoning) and exploring reinforcement‑learning fine‑tuning to reduce reliance on synthetic oracle trajectories.

Authors

Baixuan Li
Jialong Wu
Wenbiao Yin
Kuan Li
Zhongwang Zhang
Huifeng Yin
Zhengwei Tao
Liwen Zhang
Pengjun Xie
Jingren Zhou
Yong Jiang

Paper Information

arXiv ID: 2512.23647v1
Categories: cs.CL, cs.AI, cs.IR, cs.MA
Published: December 29, 2025
PDF: Download PDF

[Paper] Nested Browser-Use Learning for Agentic Information Seeking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models