Stop Scraping HTML - There's a better way.

Published: (December 16, 2025 at 02:16 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

The “API‑First” Reverse Engineering Method

One of the most common mistakes developers make is firing up their code editor too early. They open VS Code, pip install requests beautifulsoup4, and immediately start trying to parse “ tags.

If you are scraping a modern e‑commerce site or Single Page Application (SPA), this is the wrong approach. It’s brittle, it’s slow, and it breaks the moment the site updates its CSS.

The secret to scalable scraping isn’t better parsing; it’s finding the API that the website uses to populate itself. Below is the exact workflow I use to turn a complex parsing job into a clean, reliable JSON pipeline.

Phase 1: The Discovery (XHR Filtering)

Modern websites are rarely static. They typically use a frontend / backend architecture where the browser loads a skeleton page and then fetches the actual data via a background API call. Your goal is to use that call.

1. Open Developer Tools

  • Right‑click → InspectNetwork tab.

2. Filter the Noise

  • Click the Fetch/XHR filter.
  • Ignore CSS, images, fonts – focus on data requests.

3. Trigger the Request

  • Refresh the page.
  • Watch the waterfall.

If nothing notable appears, try different pages, pagination, or button clicks and watch for new requests.

4. Identify JSON Endpoints

  • Look for requests that return JSON (often named graphql, search, products, api, etc.).
  • Click Preview – you should see a structured object containing prices, descriptions, SKU numbers, etc., already parsed and clean.

Pro tip: Once you find a candidate URL, test it directly in the browser console or address bar. Change query parameters (e.g., page=1page=2). If the JSON response changes accordingly, you’ve found your golden endpoint.

5. Copy as cURL

# Example – replace with the actual request you captured
curl 'https://example.com/api/products?page=1' \
  -H 'User-Agent: Mozilla/5.0 …' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Referer: https://example.com/shop' \
  -H 'Cookie: session=abc123; …' \
  --compressed

6. Import to an API Client

Open a client such as Postman, Insomnia, or Bruno and import the cURL command.

7. Baseline Test

  • Hit Send.
  • It should work because you’re sending every cookie, header, and the exact session token your browser generated.

Trimming the Request

Efficient scrapers don’t send 2 KB of headers. Strip the request down by unchecking headers one by one and resending:

  • Cookie – Does it break? (Usually, yes.)
  • Referer – Does it break? (Often, yes.)
  • User‑Agent – Does it break?

Check the parameters: can you increase limit=10 to limit=100 to fetch more data per call?

Eventually you’ll be left with the skeleton key – the absolute minimum headers required for a 200 OK. Typically this consists of:

  1. User‑Agent
  2. Referer (or none, depending on the site)
  3. A specific Auth Token or Session Cookie

Why 403 Forbidden Happens

Even with the right URL and minimal headers, you may receive a 403. The cause is usually a cryptographic binding check performed by the API:

  • IP‑bound token: The auth token/cookie you copied was generated for the IP address of your browser. When your script runs on a different server, VPN, or proxy, the site detects a mismatch and rejects the request.
  • Expiry clock: Tokens are often short‑lived. If you reuse a static token for many requests, it will expire quickly.

Building a Hybrid Architecture

To scrape at scale you need more than a single script; you need a system that manages state, token generation, and IP coordination.

Storage Unit

A database (e.g., Redis) stores a Session Object containing:

  • The Auth Token (Cookie)
  • The IP address used to generate it
  • The creation timestamp

Browser Worker

A headless browser (e.g., Playwright, Puppeteer, Selenium) visits the site, executes JavaScript, generates a fresh token, and saves it to the storage unit.

HTTP Worker

The actual scraper that does not browse. It pulls the token + IP combination from storage and calls the API directly.

Rotation Logic

def should_refresh(token_info, max_age_seconds=300):
    """Return True if the token is older than `max_age_seconds`."""
    age = time.time() - token_info['creation_time']
    return age > max_age_seconds
  1. Check token age.
  2. If older than 5 minutes, spin up the Browser Worker, generate a new token, update the storage, and resume scraping.

This architecture implies:

  • Proxy Management – Ensure the Browser and HTTP workers share the same IP.
  • Browser Management – Handle the heavy lifting of token generation.
  • State Management – Keep token lifecycles in sync.

A Managed Solution

At Zyte, we abstract this entire architecture. Our API handles browser fingerprinting, IP coordination, and session rotation automatically. You simply send us the target URL, and we return a clean JSON response without the infrastructure headache.

Want more? Join our community.

Back to Blog

Related posts

Read more »