Stop Silent Failures: Using LLMs to Validate Web Scraper Output

Published: (February 9, 2026 at 12:43 AM EST)
7 min read
Source: Dev.to

Source: Dev.to

The Problem: Structural vs. Semantic Validation

In a traditional data pipeline, we use structural validation. Tools like pydantic in Python or JSON Schema are excellent at ensuring a field named price is a float and a field named sku is a string.

from pydantic import BaseModel

class Product(BaseModel):
    title: str
    price: float
    sku: str

If your scraper extracts the string "Free Shipping" into the price field, Pydantic will throw an error because "Free Shipping" cannot be cast to a float. This is helpful, but it doesn’t solve the semantic problem.

What if the scraper extracts "$19.99" from a “Recommended Products” sidebar instead of the main product price? Structurally, it’s a valid float. Semantically, it’s a failure. Traditional code cannot easily “read” the page to know if a piece of text is the correct piece of text. This is where an AI Judge comes in.

The Solution: The “AI Judge” Architecture

The AI Judge pattern introduces a secondary validation step in your scraping loop. Instead of trusting the parser implicitly, take a small sample of the raw HTML and the extracted JSON and pass them to an LLM.

Workflow

  1. Extraction – Your scraper (Playwright, BeautifulSoup, etc.) extracts data using selectors.
  2. Contextual Sampling – Isolate the HTML block where the data was found.
  3. Verification – An LLM compares the raw HTML to the JSON.
  4. Decision – If the LLM flags a mismatch, the system alerts the developer or triggers a retry.

By using an LLM, you leverage its ability to understand unstructured text and visual hierarchy without writing thousands of lines of fragile regex or manual checks.

Step 1: The Setup (The Fragile Scraper)

Let’s start with a standard extraction script. We’ll target a typical e‑commerce product page, similar to those in the BestBuy.com‑Scrapers repository.

import requests
from bs4 import BeautifulSoup

def extract_product_data(html_content):
    soup = BeautifulSoup(html_content, "html.parser")

    # These selectors break easily if the site updates
    return {
        "title": soup.select_one(".product-title").get_text(strip=True),
        "price": soup.select_one(".price-value").get_text(strip=True),
        "sku":   soup.select_one(".model-number").get_text(strip=True),
    }

# Imagine this HTML is fetched via requests
sample_html = (
    'Sony Alpha 7 IV'
    '$2,499.99'
)
data = extract_product_data(sample_html)
print(data)

This works today, but if the site changes .price-value to .price-display-v2, your scraper will return None or pull data from an unrelated element.

Step 2: Building the AI Validator

To build the validator, construct a prompt that asks the LLM to act as a QA Engineer. The LLM should return a structured response – a boolean and a reason for failure.

We’ll use the openai library and JSON Mode to ensure the output is machine‑readable.

import openai
import json

client = openai.OpenAI(api_key="YOUR_API_KEY")

def validate_extraction(html_snippet: str, extracted_data: dict) -> dict:
    prompt = f"""
    You are a Data Quality Auditor. Compare extracted JSON data 
    against a raw HTML snippet to ensure accuracy.

    RAW HTML:
    {html_snippet}

    EXTRACTED JSON:
    {json.dumps(extracted_data, ensure_ascii=False, indent=2)}

    Rules:
    1. Check if the 'title' in JSON matches the main product title in HTML.
    2. Check if the 'price' in JSON matches the actual product price.
    3. Ignore minor whitespace or formatting differences.
    4. If the data is missing or incorrect, set 'is_valid' to false.

    Return ONLY a JSON object with this structure:
    {{"is_valid": boolean, "reason": "string explaining the error if invalid"}}
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )

    return json.loads(response.choices[0].message.content)

Why This Works

  • Context Isolation – Sending the entire 100 KB HTML file is expensive and noisy. We only send the relevant container.
  • Semantic Comparison – The LLM understands that "$2,499.99" in the HTML is the same as "2499.99" in your JSON, even if the formatting changed.
  • Reasoning – If it fails, the "reason" field provides an immediate debugging hint.

Step 3: Implementing the Feedback Loop

Now, let’s integrate the validator into the scraping logic. In production you shouldn’t stop the entire crawl for a single error, but you should log it and halt the spider if the error rate exceeds a specific threshold.

def run_scraper(url: str, error_threshold: float = 0.05):
    html = requests.get(url).text
    extracted_data = extract_product_data(html)

    # Grab only the relevant HTML snippet (e.g., the product container)
    # For demonstration we just reuse the whole page; replace with a proper selector.
    html_snippet = html  # TODO: narrow this down

    validation = validate_extraction(html_snippet, extracted_data)

    if not validation["is_valid"]:
        # Log the failure and optionally retry or flag for manual review
        print(f"Validation failed for {url}: {validation['reason']}")
        # Increment error counter, etc.
    else:
        # Persist the clean data
        print("✅ Data validated:", extracted_data)

    # Example of error‑rate handling (pseudo‑code)
    # if error_rate > error_threshold:
    #     raise RuntimeError("Error rate exceeded – stopping crawl")

Production Tips

TipDescription
Batch validationValidate a batch of items together to reduce API calls (e.g., send 10 snippets in one request).
CachingCache LLM responses for identical HTML snippets to save cost.
Rate limitingRespect OpenAI rate limits; use exponential back‑off on 429 responses.
ObservabilityStore is_valid and reason fields in a monitoring dashboard to spot drift early.
FallbackIf the LLM is unavailable, fall back to structural validation and flag for later review.

Recap

  1. Structural validation catches type mismatches but not context errors.
  2. AI‑driven semantic validation lets an LLM verify that the extracted value truly belongs to the intended element.
  3. Integrate the validator as a lightweight, optional step in your pipeline, logging failures and acting on them only when a threshold is crossed.

By adding an AI Judge to your scraper, you turn silent failures into actionable alerts, dramatically reducing the time spent debugging broken selectors in production. Happy scraping!

Tokens

soup = BeautifulSoup(html, 'html.parser')
container = str(soup.select_one(".product-main-area"))

validation_result = validate_extraction(container, extracted_data)

if not validation_result['is_valid']:
    print(f"CRITICAL: Validation failed for {url}")
    print(f"Reason: {validation_result['reason']}")
    # Log to your monitoring system (e.g., Sentry or ScrapeOps)
    return None

return extracted_data

Optimization: Cost and Performance

Sending every request to an LLM makes your scraper slow and expensive. If you scrape 100,000 pages, a $0.01 API call per page adds up to $1,000. Use Statistical Sampling to optimize this.

1. Sampling

You don’t need to validate every row. Checking 1 % of your data is often enough to catch site‑wide layout changes.

import random

def should_validate(rate=0.01):
    return random.random() < rate

# In your loop
if should_validate(rate=0.05):   # Validate 5 % of requests
    validation_result = validate_extraction(html, data)

2. Model Selection

Avoid using GPT‑4o for simple comparisons. Models like gpt-4o-mini or claude-3-haiku are significantly cheaper and more than capable of comparing JSON to HTML. They also have much lower latency.

3. Confidence‑Based Triggers

Trigger the AI Judge only when your local code is “unsure.” For example, if a selector returns an empty string or a regex pattern fails, pass the HTML to the LLM and ask it to find the missing data.

To Wrap Up

Automating schema validation with AI moves web scraping from a “fingers crossed” approach to a rigorous engineering discipline. By using LLMs as a semantic QA layer, you can catch silent failures before they corrupt your datasets.

Key Takeaways

  • Structural validation (Pydantic) catches data‑type errors, while semantic validation (AI) catches context errors.
  • Context isolation is vital – only send relevant HTML snippets to the LLM to save on costs and improve accuracy.
  • Use sampling to keep your pipeline performant and cost‑effective.
  • Structured outputs let you integrate AI feedback directly into your code logic.

Next Step

Consider using the ScrapeOps Proxy Provider to ensure you’re getting high‑quality HTML back from your targets before you begin the validation process. Successful data extraction starts with the right tools and ends with reliable verification.

0 views
Back to Blog

Related posts

Read more »