Why I Built a Filesystem for the Browser

Published: (February 26, 2026 at 05:04 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

The three dominant approaches today (and why they’re mismatched)

ApproachDrawbacks
Screenshots + vision models• Burns vision tokens on every action
• Adds a full round‑trip per interaction
• Silent failures when coordinates shift (e.g., a cookie banner moves a button)
CSS selectors / XPath• Structural fragility (#main > div:nth-child(3) … breaks when a wrapper is added)
• Depends on developers adding test IDs ([data-testid="submit"])
• Agent must reason over raw HTML – thousands of tokens of noise
Coordinate‑based clicks• Resolution‑, viewport‑, zoom‑, and responsive‑layout‑dependent
• Any uncontrolled variable becomes a failure mode

Common problem: All three force the agent to work with a representation that wasn’t designed for programmatic navigation.

The Accessibility (AX) Tree – a ready‑made solution

Browsers already solved “navigate this page without looking at it.” The Accessibility Tree, which screen readers consume, is:

  • Deterministic
  • Semantic
  • Compact

Every button knows it’s a button, every link carries its href, every input has a label and type. No invisible wrapper <div>s, no CSS noise, no layout‑dependent coordinates.

The AX tree is the low‑entropy, structured signal agents need. The question was how to expose it.

Mapping the AX tree to a filesystem

The AX tree has a natural hierarchy:

  • Containers (navigation, main content, sidebars, forms) → directories
  • Interactive elements (buttons, links, inputs) → files

This maps cleanly to a filesystem – and every LLM already knows how to operate one.

CommandPurpose
lsList what’s on a page
cdScope into a section
catInspect an element
grepSearch
findDiscover by type
clickInteract
textBulk‑extract

These commands appear in every model’s training data → zero‑shot usability.

Example session

dom@shell:$ cd %here%
 Entered tab 386872589
  Title: Wikipedia
  URL:   https://www.wikipedia.org/

dom@shell:$ ls
[d] main/
[d] contentinfo/

dom@shell:$ cd main
dom@shell:$ tree 2
main/
├── [d] top_languages/
   ├── [x] english_7141000_articles_link
   ├── [x] deutsch_3099000_artikel_link
   ├── [x] français_2740000_articles_link
   └──
├── [d] search/
   └── [x] search_input
└── [x] read_wikipedia_in_your_language_btn

The page is now a directory tree.

submit search_input "Artificial intelligence"

Navigates, the tree auto‑refreshes, and you’re looking at the article’s filesystem.

No screenshots. No coordinates. No selectors.

Cleaning up the raw AX tree

The raw AX tree is noisy: hundreds of wrapper nodes (role=generic, role=none, unnamed <div>s) exist for CSS layout, not semantics. Without filtering you’d see generic_1, generic_2, … with no useful meaning.

DOMShell’s VFS mapper (vfs_mapper.ts) recursively flattens non‑semantic nodes, promoting their children up:

  • If a role=generic node has a single child, the child replaces it.
  • Visible elements get a name derived from their accessible name and role (submit_btn, contact_us_link, email_input).
  • Duplicates are disambiguated with _2, _3, etc.

Design decision: Minimizing node bloat maximizes the agent’s signal‑to‑noise ratio. Every flattened wrapper node is a token the model doesn’t waste reasoning about.

Architecture – three cleanly separated components

  1. Chrome Extension (the kernel)

    • Background service worker runs the shell: command parsing, AX tree traversal via CDP, filesystem mapping, DOM‑change detection.
    • Side‑panel is a thin terminal (React + Xterm.js) – only I/O, no logic.
    • Reads the AX tree through chrome.debugger (Chrome DevTools Protocol 1.3), including cross‑iframe discovery via Page.getFrameTree.
  2. MCP Server (the bridge)

    • Standalone Node.js HTTP server on localhost:3001.
    • Any MCP‑compatible client (Claude Desktop, Claude Code, Cursor, Windsurf, Gemini CLI) connects.
    • Translates MCP tool calls into shell commands, pipes them to the extension over WebSocket (localhost:9876), streams results back.
    • Supports multiple simultaneous clients.
  3. Security tiers

    • Read‑only by default – agents can browse but not act.
    • Write commands (click, type, scroll, js) require --allow-write.
    • Sensitive commands (e.g., whoami for cookies) require --allow-sensitive.
    • Domain allow‑lists restrict which sites agents can operate on.
    • Every command is audit‑logged with timestamps.
    • Auth tokens gate the WebSocket bridge.

The separation is deliberate: you can use DOMShell interactively without the MCP server, or let an agent browse your tabs without giving it the ability to click “Delete Account”.

Performance results

I ran 8 trials across 4 tasks using Claude Opus 4.6 with both DOMShell and Anthropic’s built‑in browser automation (Claude in Chrome).

Metric: Tool‑call count – directly proportional to latency and API cost.

SystemAvg. calls per task
DOMShell4.3
Claude in Chrome8.6

Result: 50 % reduction in calls.

The biggest win was content extraction: DOMShell completed it in 2 calls (navigate + extract) where Claude‑in‑Chrome needed 5‑6.

TL;DR

  • The Accessibility Tree is a deterministic, semantic, low‑entropy representation of a page.
  • Mapping it to a virtual filesystem gives agents a familiar, zero‑shot interface (ls, cd, click, …).
  • Flattening non‑semantic nodes maximizes signal‑to‑noise.
  • A three‑part architecture (Chrome extension, MCP bridge, security tiers) keeps the system flexible and safe.
  • In practice, DOMShell halves the number of tool calls needed for typical browsing tasks.

Overview

The new approach lets the agent scope to the right section (cd main/article) and bulk‑extract (text) in a single call, instead of navigating through read_page results iteratively.

Why DOMShell shines

  • Raw JavaScript executionjavascript_exec can batch multiple DOM operations into one call.
  • Compound pipeline – The for + script + each pipeline collapses multi‑page workflows into 1‑2 calls by iterating over command output and replaying saved scripts across URLs.

Impact

  • 50 % reduction in tool calls → direct savings on cost and latency.
  • For production agents (where every tool call is an API round‑trip), halving the call count is a meaningful operational improvement.

Open‑source & Roadmap

  • DOMShell is open source (MIT) and free.
  • Roadmap: a headless mode – a self‑contained Chromium process that agents can launch directly for CI pipelines and server‑side automation where no visible browser is needed.

Compound Efficiency Gains

The for + script + each pipeline is where the real gains live:

  1. Save a command sequence as a script.
  2. Replay it across N URLs in a single call.

Result: O(2N) tool calls become O(2).
For any agent performing research, extraction, or monitoring across multiple pages, this is a step change in efficiency.

Quick Reference

# Example: change directory and bulk‑extract text
cd main/article
bulk-extract text

The browser is your filesystem.
ls it.

GitHub

Project: DOMShell – Pireno

0 views
Back to Blog

Related posts

Read more »