Stop Scraping HTML - There's a better way.

Published: 1 month ago (December 16, 2025 at 02:16 PM EST)

4 min read

Source: Dev.to

The “API‑First” Reverse Engineering Method

One of the most common mistakes developers make is firing up their code editor too early. They open VS Code, pip install requests beautifulsoup4, and immediately start trying to parse “ tags.

If you are scraping a modern e‑commerce site or Single Page Application (SPA), this is the wrong approach. It’s brittle, it’s slow, and it breaks the moment the site updates its CSS.

The secret to scalable scraping isn’t better parsing; it’s finding the API that the website uses to populate itself. Below is the exact workflow I use to turn a complex parsing job into a clean, reliable JSON pipeline.

Phase 1: The Discovery (XHR Filtering)

Modern websites are rarely static. They typically use a frontend / backend architecture where the browser loads a skeleton page and then fetches the actual data via a background API call. Your goal is to use that call.

1. Open Developer Tools

Right‑click → Inspect → Network tab.

2. Filter the Noise

Click the Fetch/XHR filter.
Ignore CSS, images, fonts – focus on data requests.

3. Trigger the Request

Refresh the page.
Watch the waterfall.

If nothing notable appears, try different pages, pagination, or button clicks and watch for new requests.

4. Identify JSON Endpoints

Look for requests that return JSON (often named graphql, search, products, api, etc.).
Click Preview – you should see a structured object containing prices, descriptions, SKU numbers, etc., already parsed and clean.

Pro tip: Once you find a candidate URL, test it directly in the browser console or address bar. Change query parameters (e.g., page=1 → page=2). If the JSON response changes accordingly, you’ve found your golden endpoint.

5. Copy as cURL

# Example – replace with the actual request you captured
curl 'https://example.com/api/products?page=1' \
  -H 'User-Agent: Mozilla/5.0 …' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Referer: https://example.com/shop' \
  -H 'Cookie: session=abc123; …' \
  --compressed

6. Import to an API Client

Open a client such as Postman, Insomnia, or Bruno and import the cURL command.

7. Baseline Test

Hit Send.
It should work because you’re sending every cookie, header, and the exact session token your browser generated.

Trimming the Request

Efficient scrapers don’t send 2 KB of headers. Strip the request down by unchecking headers one by one and resending:

Cookie – Does it break? (Usually, yes.)
Referer – Does it break? (Often, yes.)
User‑Agent – Does it break?

Check the parameters: can you increase limit=10 to limit=100 to fetch more data per call?

Eventually you’ll be left with the skeleton key – the absolute minimum headers required for a 200 OK. Typically this consists of:

User‑Agent
Referer (or none, depending on the site)
A specific Auth Token or Session Cookie

Why 403 Forbidden Happens

Even with the right URL and minimal headers, you may receive a 403. The cause is usually a cryptographic binding check performed by the API:

IP‑bound token: The auth token/cookie you copied was generated for the IP address of your browser. When your script runs on a different server, VPN, or proxy, the site detects a mismatch and rejects the request.
Expiry clock: Tokens are often short‑lived. If you reuse a static token for many requests, it will expire quickly.

Building a Hybrid Architecture

To scrape at scale you need more than a single script; you need a system that manages state, token generation, and IP coordination.

Storage Unit

A database (e.g., Redis) stores a Session Object containing:

The Auth Token (Cookie)
The IP address used to generate it
The creation timestamp

Browser Worker

A headless browser (e.g., Playwright, Puppeteer, Selenium) visits the site, executes JavaScript, generates a fresh token, and saves it to the storage unit.

HTTP Worker

The actual scraper that does not browse. It pulls the token + IP combination from storage and calls the API directly.

Rotation Logic

def should_refresh(token_info, max_age_seconds=300):
    """Return True if the token is older than `max_age_seconds`."""
    age = time.time() - token_info['creation_time']
    return age > max_age_seconds

Check token age.
If older than 5 minutes, spin up the Browser Worker, generate a new token, update the storage, and resume scraping.

This architecture implies:

Proxy Management – Ensure the Browser and HTTP workers share the same IP.
Browser Management – Handle the heavy lifting of token generation.
State Management – Keep token lifecycles in sync.

A Managed Solution

At Zyte, we abstract this entire architecture. Our API handles browser fingerprinting, IP coordination, and session rotation automatically. You simply send us the target URL, and we return a clean JSON response without the infrastructure headache.

Want more? Join our community.

Stop Scraping HTML - There's a better way.

The “API‑First” Reverse Engineering Method

Phase 1: The Discovery (XHR Filtering)

1. Open Developer Tools

2. Filter the Noise

3. Trigger the Request

4. Identify JSON Endpoints

5. Copy as cURL

6. Import to an API Client

7. Baseline Test

Trimming the Request

Why 403 Forbidden Happens

Building a Hybrid Architecture

Storage Unit

Browser Worker

HTTP Worker

Rotation Logic

A Managed Solution

Related posts

The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

Add authentication to your Nuxt 3 and Vue 3 applications (Logto)

Testing de SPAs con Vitest Browser Mode: Velocidad de Unit Tests con Confianza E2E

Day 28 of improving my Data Science skills

The “API‑First” Reverse Engineering Method

Phase 1: The Discovery (XHR Filtering)

1. Open Developer Tools

2. Filter the Noise

3. Trigger the Request

4. Identify JSON Endpoints

5. Copy as cURL

6. Import to an API Client

7. Baseline Test

Trimming the Request

Why 403 Forbidden Happens

Building a Hybrid Architecture

Storage Unit

Browser Worker

HTTP Worker

Rotation Logic

A Managed Solution

Related posts

The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

Add authentication to your Nuxt 3 and Vue 3 applications (Logto)

Testing de SPAs con Vitest Browser Mode: Velocidad de Unit Tests con Confianza E2E

Day 28 of improving my Data Science skills

Phase 1: The Discovery (XHR Filtering)

Why 403 Forbidden Happens