Inside domharvest-playwright: How I Architected a Production-Ready Web Scraping Tool
Source: Dev.to
The Core Architecture
domharvest-playwright is built around three main components:
- DOMHarvester Class – The main orchestrator
- Browser Management – Playwright lifecycle handling
- Data Extraction Pipeline – Selector‑based harvesting
Design Principles
Simplicity First
Every architectural decision prioritized simplicity over cleverness. No over‑abstraction, no unnecessary patterns.
Fail Fast, Fail Clear
Errors should be obvious and actionable. No silent failures.
Composability
Small, focused methods that can be combined for complex workflows.
Browser Lifecycle Management
class DOMHarvester {
async init(options = {}) {
this.browser = await playwright.chromium.launch({
headless: options.headless ?? true,
...options.browserOptions
})
this.context = await this.browser.newContext(options.contextOptions)
}
async close() {
await this.context?.close()
await this.browser?.close()
}
}
Why this approach?
- Explicit initialization gives users control.
- Separate context management enables multiple sessions.
- Clean shutdown prevents resource leaks.
The Harvesting Pipeline
The core harvest() method follows a straightforward flow:
async harvest(url, selector, extractor) {
const page = await this.context.newPage()
try {
await page.goto(url, { waitUntil: 'networkidle' })
const elements = await page.$$(selector)
const results = []
for (const element of elements) {
const data = await element.evaluate(extractor)
results.push(data)
}
return results
} finally {
await page.close()
}
}
Key decisions
waitUntil: 'networkidle'balances speed and reliability.- Sequential processing prevents race conditions.
finallyblock ensures cleanup even on errors.- Extractor function runs in the browser context for performance.
Error Handling Strategy
try {
await page.goto(url, {
waitUntil: 'networkidle',
timeout: 30000
})
} catch (error) {
if (error.name === 'TimeoutError') {
throw new Error(`Failed to load ${url}: timeout after 30s`)
}
throw error
}
I wrap Playwright errors with context‑specific messages to help users debug without diving into stack traces.
Custom Extraction Support
Beyond selector‑based harvesting, harvestCustom() allows arbitrary page evaluation:
async harvestCustom(url, evaluator) {
const page = await this.context.newPage()
try {
await page.goto(url, { waitUntil: 'networkidle' })
return await page.evaluate(evaluator)
} finally {
await page.close()
}
}
This enables complex scenarios like:
- Multi‑step interactions
- Conditional logic based on page state
- Aggregating data from multiple sources
Testing Architecture
Tests are organized by concern:
test/
├── unit/
│ ├── harvester.test.js
│ └── browser-management.test.js
├── integration/
│ └── harvest-workflow.test.js
└── fixtures/
└── sample-pages/
Using real HTML fixtures instead of mocking ensures tests catch real‑world issues.
Performance Considerations
Page Reuse vs. Clean State
I create new pages per harvest for isolation. Slight performance cost, but it eliminates entire classes of bugs.
Parallel vs. Sequential
Sequential processing is the default for predictability. Users can parallelize at the application level if needed.
Memory Management
Explicit page cleanup in finally blocks prevents memory leaks during long‑running sessions.
Code Organization
src/
├── index.js # Public API
├── harvester.js # DOMHarvester class
└── utils/
├── validators.js # Input validation
└── errors.js # Custom error types
Flat structure, no deep nesting—easy to navigate.
Lessons Learned
- Don’t Abstract Too Early – Resisted a “Strategy Pattern” for different scraping modes; YAGNI was right.
- Explicit > Implicit – Requiring
init()before use feels verbose but prevents confusing initialization bugs. - Browser Automation is I/O Heavy – Network latency dominates; focus on reliability over micro‑optimizations.
- Error Messages Matter – Users see errors more than code; make them helpful.
What’s Next?
Future architectural improvements under consideration:
- Plugin system for custom middleware
- Built‑in retry logic with exponential backoff
- Request/response interception hooks
- Streaming results for large datasets
Try It Yourself
npm install domharvest-playwright
The architecture is intentionally simple. Read the source—it’s under 500 lines.