Stop Silent Failures: Using LLMs to Validate Web Scraper Output
Source: Dev.to
The Problem: Structural vs. Semantic Validation
In a traditional data pipeline, we use structural validation. Tools like pydantic in Python or JSON Schema are excellent at ensuring a field named price is a float and a field named sku is a string.
from pydantic import BaseModel
class Product(BaseModel):
title: str
price: float
sku: str
If your scraper extracts the string "Free Shipping" into the price field, Pydantic will throw an error because "Free Shipping" cannot be cast to a float. This is helpful, but it doesn’t solve the semantic problem.
What if the scraper extracts "$19.99" from a “Recommended Products” sidebar instead of the main product price? Structurally, it’s a valid float. Semantically, it’s a failure. Traditional code cannot easily “read” the page to know if a piece of text is the correct piece of text. This is where an AI Judge comes in.
The Solution: The “AI Judge” Architecture
The AI Judge pattern introduces a secondary validation step in your scraping loop. Instead of trusting the parser implicitly, take a small sample of the raw HTML and the extracted JSON and pass them to an LLM.
Workflow
- Extraction – Your scraper (Playwright, BeautifulSoup, etc.) extracts data using selectors.
- Contextual Sampling – Isolate the HTML block where the data was found.
- Verification – An LLM compares the raw HTML to the JSON.
- Decision – If the LLM flags a mismatch, the system alerts the developer or triggers a retry.
By using an LLM, you leverage its ability to understand unstructured text and visual hierarchy without writing thousands of lines of fragile regex or manual checks.
Step 1: The Setup (The Fragile Scraper)
Let’s start with a standard extraction script. We’ll target a typical e‑commerce product page, similar to those in the BestBuy.com‑Scrapers repository.
import requests
from bs4 import BeautifulSoup
def extract_product_data(html_content):
soup = BeautifulSoup(html_content, "html.parser")
# These selectors break easily if the site updates
return {
"title": soup.select_one(".product-title").get_text(strip=True),
"price": soup.select_one(".price-value").get_text(strip=True),
"sku": soup.select_one(".model-number").get_text(strip=True),
}
# Imagine this HTML is fetched via requests
sample_html = (
'Sony Alpha 7 IV'
'$2,499.99'
)
data = extract_product_data(sample_html)
print(data)
This works today, but if the site changes .price-value to .price-display-v2, your scraper will return None or pull data from an unrelated element.
Step 2: Building the AI Validator
To build the validator, construct a prompt that asks the LLM to act as a QA Engineer. The LLM should return a structured response – a boolean and a reason for failure.
We’ll use the openai library and JSON Mode to ensure the output is machine‑readable.
import openai
import json
client = openai.OpenAI(api_key="YOUR_API_KEY")
def validate_extraction(html_snippet: str, extracted_data: dict) -> dict:
prompt = f"""
You are a Data Quality Auditor. Compare extracted JSON data
against a raw HTML snippet to ensure accuracy.
RAW HTML:
{html_snippet}
EXTRACTED JSON:
{json.dumps(extracted_data, ensure_ascii=False, indent=2)}
Rules:
1. Check if the 'title' in JSON matches the main product title in HTML.
2. Check if the 'price' in JSON matches the actual product price.
3. Ignore minor whitespace or formatting differences.
4. If the data is missing or incorrect, set 'is_valid' to false.
Return ONLY a JSON object with this structure:
{{"is_valid": boolean, "reason": "string explaining the error if invalid"}}
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
Why This Works
- Context Isolation – Sending the entire 100 KB HTML file is expensive and noisy. We only send the relevant container.
- Semantic Comparison – The LLM understands that
"$2,499.99"in the HTML is the same as"2499.99"in your JSON, even if the formatting changed. - Reasoning – If it fails, the
"reason"field provides an immediate debugging hint.
Step 3: Implementing the Feedback Loop
Now, let’s integrate the validator into the scraping logic. In production you shouldn’t stop the entire crawl for a single error, but you should log it and halt the spider if the error rate exceeds a specific threshold.
def run_scraper(url: str, error_threshold: float = 0.05):
html = requests.get(url).text
extracted_data = extract_product_data(html)
# Grab only the relevant HTML snippet (e.g., the product container)
# For demonstration we just reuse the whole page; replace with a proper selector.
html_snippet = html # TODO: narrow this down
validation = validate_extraction(html_snippet, extracted_data)
if not validation["is_valid"]:
# Log the failure and optionally retry or flag for manual review
print(f"Validation failed for {url}: {validation['reason']}")
# Increment error counter, etc.
else:
# Persist the clean data
print("✅ Data validated:", extracted_data)
# Example of error‑rate handling (pseudo‑code)
# if error_rate > error_threshold:
# raise RuntimeError("Error rate exceeded – stopping crawl")
Production Tips
| Tip | Description |
|---|---|
| Batch validation | Validate a batch of items together to reduce API calls (e.g., send 10 snippets in one request). |
| Caching | Cache LLM responses for identical HTML snippets to save cost. |
| Rate limiting | Respect OpenAI rate limits; use exponential back‑off on 429 responses. |
| Observability | Store is_valid and reason fields in a monitoring dashboard to spot drift early. |
| Fallback | If the LLM is unavailable, fall back to structural validation and flag for later review. |
Recap
- Structural validation catches type mismatches but not context errors.
- AI‑driven semantic validation lets an LLM verify that the extracted value truly belongs to the intended element.
- Integrate the validator as a lightweight, optional step in your pipeline, logging failures and acting on them only when a threshold is crossed.
By adding an AI Judge to your scraper, you turn silent failures into actionable alerts, dramatically reducing the time spent debugging broken selectors in production. Happy scraping!
Tokens
soup = BeautifulSoup(html, 'html.parser')
container = str(soup.select_one(".product-main-area"))
validation_result = validate_extraction(container, extracted_data)
if not validation_result['is_valid']:
print(f"CRITICAL: Validation failed for {url}")
print(f"Reason: {validation_result['reason']}")
# Log to your monitoring system (e.g., Sentry or ScrapeOps)
return None
return extracted_data
Optimization: Cost and Performance
Sending every request to an LLM makes your scraper slow and expensive. If you scrape 100,000 pages, a $0.01 API call per page adds up to $1,000. Use Statistical Sampling to optimize this.
1. Sampling
You don’t need to validate every row. Checking 1 % of your data is often enough to catch site‑wide layout changes.
import random
def should_validate(rate=0.01):
return random.random() < rate
# In your loop
if should_validate(rate=0.05): # Validate 5 % of requests
validation_result = validate_extraction(html, data)
2. Model Selection
Avoid using GPT‑4o for simple comparisons. Models like gpt-4o-mini or claude-3-haiku are significantly cheaper and more than capable of comparing JSON to HTML. They also have much lower latency.
3. Confidence‑Based Triggers
Trigger the AI Judge only when your local code is “unsure.” For example, if a selector returns an empty string or a regex pattern fails, pass the HTML to the LLM and ask it to find the missing data.
To Wrap Up
Automating schema validation with AI moves web scraping from a “fingers crossed” approach to a rigorous engineering discipline. By using LLMs as a semantic QA layer, you can catch silent failures before they corrupt your datasets.
Key Takeaways
- Structural validation (Pydantic) catches data‑type errors, while semantic validation (AI) catches context errors.
- Context isolation is vital – only send relevant HTML snippets to the LLM to save on costs and improve accuracy.
- Use sampling to keep your pipeline performant and cost‑effective.
- Structured outputs let you integrate AI feedback directly into your code logic.
Next Step
Consider using the ScrapeOps Proxy Provider to ensure you’re getting high‑quality HTML back from your targets before you begin the validation process. Successful data extraction starts with the right tools and ends with reliable verification.