The Waterfall Pattern: A Tiered Strategy for Reliable Data Extraction

Published: (February 14, 2026 at 03:05 PM EST)
7 min read
Source: Dev.to

Source: Dev.to

The Waterfall Method – Building Resilient Scrapers

It’s 3:00 AM, and your production scraper just crashed. The logs reveal a common culprit: a developer at the target website renamed a CSS class from product-price to price‑v2‑red. It was a cosmetic change that took five seconds, but it broke your entire data pipeline.

If you rely solely on visual CSS selectors, you are building on shifting sand. Websites change constantly, and every redesign becomes a maintenance nightmare. To build resilient scrapers, use a “Waterfall” approach—a tiered priority system that falls back through multiple extraction methods before giving up.

The Hierarchy of Stability

A webpage is more than just a visual document; it consists of different layers of data, each with varying levels of stability. The Waterfall Method prioritizes these layers from most stable to least stable.

TierLayerWhy It’s Stable
Tier 1Hidden Data (JSON‑LD / Script Tags)Structured data used for SEO or internal JavaScript frameworks is designed for machines, not humans. It rarely changes when the UI is redesigned.
Tier 2Semantic Anchors (IDs / Data Attributes)Unique identifiers like id="product-123" or data-testid="price-display" are usually tied to database keys or automated‑testing suites. Developers rarely change these because it breaks their own internal tools.
Tier 3Relational XPathIf specific IDs are missing, look for labels. While CSS classes change, the word “Price:” usually stays “Price:”. XPath can find that text and grab the element next to it.
Tier 4Visual Selectors (CSS Classes)This is the last resort. CSS classes like .blue-text change whenever a designer wants a new look. Use these only if every other method fails.

By starting at Tier 1 and descending through the waterfall, you maximize success while minimizing maintenance.

Setting Up the Environment

We’ll use parsel, the library that powers Scrapy, because it allows you to use CSS, XPath, and regular expressions within a single object.

pip install parsel requests

Mock HTML snippet

The following HTML fragment will be used throughout the guide. It represents a typical e‑commerce page with multiple data layers:

html_content = """

    
        
        {
            "@context": "https://schema.org/",
            "@type": "Product",
            "name": "Ultimate Coffee Grinder",
            "sku": "GRND-99",
            "offers": {
                "price": "89.99",
                "priceCurrency": "USD"
            }
        }
        
    
    
        
            
## Ultimate Coffee Grinder

            
                Price:
                $89.99
            
        
    

"""

Tier 1 – The Gold Standard (Hidden JSON)

Modern websites often embed structured data in <script> tags (usually JSON‑LD for SEO or a “window state” object for frameworks like Next.js). This source is highly stable because it is independent of the HTML layout.

import json
from parsel import Selector

def extract_tier_1(selector):
    # Locate the script tag containing JSON‑LD
    json_data = selector.css('script[type="application/ld+json"]::text').get()
    if json_data:
        data = json.loads(json_data)
        # Navigate the dictionary safely
        return data.get('offers', {}).get('price')
    return None

sel = Selector(text=html_content)
print(f"Tier 1 Result: {extract_tier_1(sel)}")

Tier 2 – Semantic Anchors (IDs & Data Attributes)

If JSON‑LD isn’t available, look for Semantic Anchors. These attributes describe what the data is rather than how it looks. IDs and data‑* attributes are frequently used for state management or end‑to‑end testing and change far less often than styling classes.

def extract_tier_2(selector):
    # Try an ID first. If IDs are dynamic, use a “starts‑with” selector.
    price = selector.css('[id^="price-id-"]::text').get()

    # Fall back to data attributes often used in modern frameworks
    if not price:
        price = selector.css('[data-testid="product-price"]::text').get()

    return price.replace('$', '').strip() if price else None

Tier 3 – Text‑Based Relational Logic (XPath)

When clean IDs are missing, rely on the visible text labels. On an e‑commerce site, the word “Price:” is almost always present next to the actual value.

Using XPath axes, you can locate the element containing the label “Price:” and navigate to its neighbor. This label‑to‑value relationship usually persists even if the tag types change.

def extract_tier_3(selector):
    # Find a <span> containing "Price:", then get the next sibling <span>
    xpath_query = "//span[contains(text(), 'Price:')]/following-sibling::span/text()"
    price = selector.xpath(xpath_query).get()
    return price.replace('$', '').strip() if price else None

Tier 4 – The Last Resort (Regex)

Sometimes the DOM is a mess: obfuscated classes, no IDs, and deeply nested structures. In these cases, treat the HTML as a plain string and use regular expressions. Regex ignores the DOM tree entirely, allowing you to pull out values based on patterns.

import re

def extract_tier_4(html):
    # Look for a dollar amount preceded by optional whitespace and a "$"
    match = re.search(r'\$?\s*([0-9]+(?:\.[0-9]{2})?)', html)
    return match.group(1) if match else None

Putting It All Together

def waterfall_extract(html):
    selector = Selector(text=html)

    # Tier 1
    price = extract_tier_1(selector)
    if price:
        return price

    # Tier 2
    price = extract_tier_2(selector)
    if price:
        return price

    # Tier 3
    price = extract_tier_3(selector)
    if price:
        return price

    # Tier 4
    return extract_tier_4(html)

print("Final price:", waterfall_extract(html_content))

Running the script against the mock HTML yields:

Final price: 89.99

Recap

  1. Start with hidden, machine‑readable data (JSON‑LD, API payloads).
  2. Fall back to semantic anchors (id, data‑*).
  3. Use relational XPath based on stable text labels.
  4. Resort to regex only when the DOM offers no reliable hooks.

By following this Waterfall Method, your scrapers become far more resilient to redesigns, class renames, and other superficial changes—saving you countless late‑night debugging sessions. Happy scraping!

Additional Regex Fallback (Tier 4 – Alternative)

When all else fails, you can search for a price pattern hidden inside JavaScript variables or deeply‑nested strings.

import re

def extract_tier_4(html_string):
    # Search for a pattern like price: "89.99" anywhere in the raw HTML
    match = re.search(r'price":\s*"([\d.]+)"', html_string)
    if match:
        return match.group(1)
    return None

Comprehensive Waterfall Function with Logging

Combine the tiered methods into a single function. Prioritize the most stable methods and log warnings when you have to fall back to lower tiers. This alert system tells you when a site has changed before your scraper actually breaks.

import logging
from parsel import Selector   # or any selector library you use

logging.basicConfig(level=logging.INFO)

def get_product_price(html):
    sel = Selector(text=html)

    # Tier 1: JSON‑LD
    price = extract_tier_1(sel)
    if price:
        return price
    logging.warning("Tier 1 failed. Falling back to Tier 2 (Attributes).")

    # Tier 2: Semantic Attributes
    price = extract_tier_2(sel)
    if price:
        return price
    logging.warning("Tier 2 failed. Falling back to Tier 3 (XPath Relational).")

    # Tier 3: XPath Relational
    price = extract_tier_3(sel)
    if price:
        return price
    logging.error("Tier 1‑3 failed. Attempting Tier 4 (Regex) as last resort.")

    # Tier 4: Regex on raw string
    return extract_tier_4(html)

final_price = get_product_price(html_content)
print(f"Final Extracted Price: {final_price}")

Why This Matters

Imagine the website owners update their site: they delete the JSON‑LD (Tier 1) and change all their CSS classes (Tier 2).

In a traditional scraper, your code would return None and crash. With the Waterfall Method, your Tier 3 logic would still find the data. You would receive a warning in your logs, allowing you to update the primary selectors during work hours rather than dealing with an emergency at midnight.

To Wrap Up

Resilient scraping requires accepting that websites are dynamic. The Waterfall Method provides a safety net for your data extraction.

  • Prioritize Machine‑Readable Data: Check for JSON‑LD or <script> tags first.
  • Use Semantic Anchors: Favor data- attributes and id tags over CSS classes.
  • Use XPath Relationships: Use human‑readable labels as anchors to find neighboring data.
  • Monitor Fallbacks: Log when your scraper hits lower tiers to address selector changes proactively.

By moving away from fragile, class‑based selectors, you spend less time fixing broken code and more time using your data. For more advanced examples, see the Homedepot.com Scrapers repository.

0 views
Back to Blog

Related posts

Read more »