Inside domharvest-playwright: How I Architected a Production-Ready Web Scraping Tool

Published: (January 9, 2026 at 09:10 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

The Core Architecture

domharvest-playwright is built around three main components:

  • DOMHarvester Class – The main orchestrator
  • Browser Management – Playwright lifecycle handling
  • Data Extraction Pipeline – Selector‑based harvesting

Design Principles

Simplicity First

Every architectural decision prioritized simplicity over cleverness. No over‑abstraction, no unnecessary patterns.

Fail Fast, Fail Clear

Errors should be obvious and actionable. No silent failures.

Composability

Small, focused methods that can be combined for complex workflows.

Browser Lifecycle Management

class DOMHarvester {
  async init(options = {}) {
    this.browser = await playwright.chromium.launch({
      headless: options.headless ?? true,
      ...options.browserOptions
    })
    this.context = await this.browser.newContext(options.contextOptions)
  }

  async close() {
    await this.context?.close()
    await this.browser?.close()
  }
}

Why this approach?

  • Explicit initialization gives users control.
  • Separate context management enables multiple sessions.
  • Clean shutdown prevents resource leaks.

The Harvesting Pipeline

The core harvest() method follows a straightforward flow:

async harvest(url, selector, extractor) {
  const page = await this.context.newPage()

  try {
    await page.goto(url, { waitUntil: 'networkidle' })

    const elements = await page.$$(selector)
    const results = []

    for (const element of elements) {
      const data = await element.evaluate(extractor)
      results.push(data)
    }

    return results
  } finally {
    await page.close()
  }
}

Key decisions

  • waitUntil: 'networkidle' balances speed and reliability.
  • Sequential processing prevents race conditions.
  • finally block ensures cleanup even on errors.
  • Extractor function runs in the browser context for performance.

Error Handling Strategy

try {
  await page.goto(url, {
    waitUntil: 'networkidle',
    timeout: 30000
  })
} catch (error) {
  if (error.name === 'TimeoutError') {
    throw new Error(`Failed to load ${url}: timeout after 30s`)
  }
  throw error
}

I wrap Playwright errors with context‑specific messages to help users debug without diving into stack traces.

Custom Extraction Support

Beyond selector‑based harvesting, harvestCustom() allows arbitrary page evaluation:

async harvestCustom(url, evaluator) {
  const page = await this.context.newPage()

  try {
    await page.goto(url, { waitUntil: 'networkidle' })
    return await page.evaluate(evaluator)
  } finally {
    await page.close()
  }
}

This enables complex scenarios like:

  • Multi‑step interactions
  • Conditional logic based on page state
  • Aggregating data from multiple sources

Testing Architecture

Tests are organized by concern:

test/
├── unit/
│   ├── harvester.test.js
│   └── browser-management.test.js
├── integration/
│   └── harvest-workflow.test.js
└── fixtures/
    └── sample-pages/

Using real HTML fixtures instead of mocking ensures tests catch real‑world issues.

Performance Considerations

Page Reuse vs. Clean State

I create new pages per harvest for isolation. Slight performance cost, but it eliminates entire classes of bugs.

Parallel vs. Sequential

Sequential processing is the default for predictability. Users can parallelize at the application level if needed.

Memory Management

Explicit page cleanup in finally blocks prevents memory leaks during long‑running sessions.

Code Organization

src/
├── index.js          # Public API
├── harvester.js      # DOMHarvester class
└── utils/
    ├── validators.js # Input validation
    └── errors.js     # Custom error types

Flat structure, no deep nesting—easy to navigate.

Lessons Learned

  1. Don’t Abstract Too Early – Resisted a “Strategy Pattern” for different scraping modes; YAGNI was right.
  2. Explicit > Implicit – Requiring init() before use feels verbose but prevents confusing initialization bugs.
  3. Browser Automation is I/O Heavy – Network latency dominates; focus on reliability over micro‑optimizations.
  4. Error Messages Matter – Users see errors more than code; make them helpful.

What’s Next?

Future architectural improvements under consideration:

  • Plugin system for custom middleware
  • Built‑in retry logic with exponential backoff
  • Request/response interception hooks
  • Streaming results for large datasets

Try It Yourself

npm install domharvest-playwright

The architecture is intentionally simple. Read the source—it’s under 500 lines.

Back to Blog

Related posts

Read more »