Why I Built a Filesystem for the Browser
Source: Dev.to
The three dominant approaches today (and why they’re mismatched)
| Approach | Drawbacks |
|---|---|
| Screenshots + vision models | • Burns vision tokens on every action • Adds a full round‑trip per interaction • Silent failures when coordinates shift (e.g., a cookie banner moves a button) |
| CSS selectors / XPath | • Structural fragility (#main > div:nth-child(3) … breaks when a wrapper is added) • Depends on developers adding test IDs ( [data-testid="submit"]) • Agent must reason over raw HTML – thousands of tokens of noise |
| Coordinate‑based clicks | • Resolution‑, viewport‑, zoom‑, and responsive‑layout‑dependent • Any uncontrolled variable becomes a failure mode |
Common problem: All three force the agent to work with a representation that wasn’t designed for programmatic navigation.
The Accessibility (AX) Tree – a ready‑made solution
Browsers already solved “navigate this page without looking at it.” The Accessibility Tree, which screen readers consume, is:
- Deterministic
- Semantic
- Compact
Every button knows it’s a button, every link carries its href, every input has a label and type. No invisible wrapper <div>s, no CSS noise, no layout‑dependent coordinates.
The AX tree is the low‑entropy, structured signal agents need. The question was how to expose it.
Mapping the AX tree to a filesystem
The AX tree has a natural hierarchy:
- Containers (navigation, main content, sidebars, forms) → directories
- Interactive elements (buttons, links, inputs) → files
This maps cleanly to a filesystem – and every LLM already knows how to operate one.
| Command | Purpose |
|---|---|
ls | List what’s on a page |
cd | Scope into a section |
cat | Inspect an element |
grep | Search |
find | Discover by type |
click | Interact |
text | Bulk‑extract |
These commands appear in every model’s training data → zero‑shot usability.
Example session
dom@shell:$ cd %here%
✓ Entered tab 386872589
Title: Wikipedia
URL: https://www.wikipedia.org/
dom@shell:$ ls
[d] main/
[d] contentinfo/
dom@shell:$ cd main
dom@shell:$ tree 2
main/
├── [d] top_languages/
│ ├── [x] english_7141000_articles_link
│ ├── [x] deutsch_3099000_artikel_link
│ ├── [x] français_2740000_articles_link
│ └── …
├── [d] search/
│ └── [x] search_input
└── [x] read_wikipedia_in_your_language_btn
The page is now a directory tree.
submit search_input "Artificial intelligence"
Navigates, the tree auto‑refreshes, and you’re looking at the article’s filesystem.
No screenshots. No coordinates. No selectors.
Cleaning up the raw AX tree
The raw AX tree is noisy: hundreds of wrapper nodes (role=generic, role=none, unnamed <div>s) exist for CSS layout, not semantics. Without filtering you’d see generic_1, generic_2, … with no useful meaning.
DOMShell’s VFS mapper (vfs_mapper.ts) recursively flattens non‑semantic nodes, promoting their children up:
- If a
role=genericnode has a single child, the child replaces it. - Visible elements get a name derived from their accessible name and role (
submit_btn,contact_us_link,email_input). - Duplicates are disambiguated with
_2,_3, etc.
Design decision: Minimizing node bloat maximizes the agent’s signal‑to‑noise ratio. Every flattened wrapper node is a token the model doesn’t waste reasoning about.
Architecture – three cleanly separated components
-
Chrome Extension (the kernel)
- Background service worker runs the shell: command parsing, AX tree traversal via CDP, filesystem mapping, DOM‑change detection.
- Side‑panel is a thin terminal (React + Xterm.js) – only I/O, no logic.
- Reads the AX tree through
chrome.debugger(Chrome DevTools Protocol 1.3), including cross‑iframe discovery viaPage.getFrameTree.
-
MCP Server (the bridge)
- Standalone Node.js HTTP server on
localhost:3001. - Any MCP‑compatible client (Claude Desktop, Claude Code, Cursor, Windsurf, Gemini CLI) connects.
- Translates MCP tool calls into shell commands, pipes them to the extension over WebSocket (
localhost:9876), streams results back. - Supports multiple simultaneous clients.
- Standalone Node.js HTTP server on
-
Security tiers
- Read‑only by default – agents can browse but not act.
- Write commands (
click,type,scroll,js) require--allow-write. - Sensitive commands (e.g.,
whoamifor cookies) require--allow-sensitive. - Domain allow‑lists restrict which sites agents can operate on.
- Every command is audit‑logged with timestamps.
- Auth tokens gate the WebSocket bridge.
The separation is deliberate: you can use DOMShell interactively without the MCP server, or let an agent browse your tabs without giving it the ability to click “Delete Account”.
Performance results
I ran 8 trials across 4 tasks using Claude Opus 4.6 with both DOMShell and Anthropic’s built‑in browser automation (Claude in Chrome).
Metric: Tool‑call count – directly proportional to latency and API cost.
| System | Avg. calls per task |
|---|---|
| DOMShell | 4.3 |
| Claude in Chrome | 8.6 |
Result: 50 % reduction in calls.
The biggest win was content extraction: DOMShell completed it in 2 calls (navigate + extract) where Claude‑in‑Chrome needed 5‑6.
TL;DR
- The Accessibility Tree is a deterministic, semantic, low‑entropy representation of a page.
- Mapping it to a virtual filesystem gives agents a familiar, zero‑shot interface (
ls,cd,click, …). - Flattening non‑semantic nodes maximizes signal‑to‑noise.
- A three‑part architecture (Chrome extension, MCP bridge, security tiers) keeps the system flexible and safe.
- In practice, DOMShell halves the number of tool calls needed for typical browsing tasks.
Overview
The new approach lets the agent scope to the right section (cd main/article) and bulk‑extract (text) in a single call, instead of navigating through read_page results iteratively.
Why DOMShell shines
- Raw JavaScript execution –
javascript_execcan batch multiple DOM operations into one call. - Compound pipeline – The
for + script + eachpipeline collapses multi‑page workflows into 1‑2 calls by iterating over command output and replaying saved scripts across URLs.
Impact
- 50 % reduction in tool calls → direct savings on cost and latency.
- For production agents (where every tool call is an API round‑trip), halving the call count is a meaningful operational improvement.
Open‑source & Roadmap
- DOMShell is open source (MIT) and free.
- Roadmap: a headless mode – a self‑contained Chromium process that agents can launch directly for CI pipelines and server‑side automation where no visible browser is needed.
Compound Efficiency Gains
The for + script + each pipeline is where the real gains live:
- Save a command sequence as a script.
- Replay it across N URLs in a single call.
Result: O(2N) tool calls become O(2).
For any agent performing research, extraction, or monitoring across multiple pages, this is a step change in efficiency.
Quick Reference
# Example: change directory and bulk‑extract text
cd main/article
bulk-extract text
The browser is your filesystem.
lsit.
GitHub
Project: DOMShell – Pireno