Feed Rescue: Converting Raw Ulta Scrapes into Google Merchant Center XML
Source: Dev.to
Phase 1 – Analyzing the Source Data
Before we generate any XML we need to understand the raw material.
The Ulta.com‑Scrapers repository (Selenium & Playwright versions) outputs a JSONL file where each line conforms to the same ScrapedData dataclass.
Typical raw record:
{
"productId": "2583561",
"name": "CeraVe - Hydrating Facial Cleanser",
"brand": "CeraVe",
"price": 17.99,
"currency": "USD",
"availability": "in_stock",
"description": "Hydrating Facial Cleanser gently removes dirt...",
"images": [
{
"url": "https://media.ulta.com/i/ulta/2583561?w=2000&h=2000",
"altText": "Product Image"
}
],
"url": "https://www.ulta.com/p/hydrating-facial-cleanser-pimprod2001719"
}
Why JSONL?
Because it is line‑delimited, we can stream the file product‑by‑product without loading the entire (potentially 500 MB) dataset into memory.
Phase 2 – Google Merchant Center Specification
Google Shopping accepts only an RSS 2.0 (or Atom) feed that uses the g: namespace. Below is the required mapping from the Ulta scraper fields to the GMC XML tags.
| Ulta Scraper Field | GMC XML Tag | Requirement / Notes |
|---|---|---|
productId | g:id | Unique identifier |
name | g:title | Max 150 characters |
description | g:description | Plain text, no broken HTML |
url | g:link | Absolute URL |
images[0]['url'] | g:image_link | High‑resolution primary image |
price + currency | g:price | Format: 17.99 USD |
availability | g:availability | Must be one of in_stock, out_of_stock, preorder |
brand | g:brand | Optional but recommended |
| (static) | g:condition | Set to new for all products |
Phase 3 – Field Mapping & Transformation Logic
1. Price Normalization
def format_gmc_price(amount: float, currency: str) -> str:
"""
Return a price string in the format required by Google Merchant Center:
e.g. 17.99 USD
"""
return f"{amount:.2f} {currency}"
Ensures two‑decimal precision and appends the ISO‑4217 currency code.
2. Availability Mapping
The scraper already returns in_stock, out_of_stock, or preorder.
For safety we map any unexpected value to out_of_stock:
def map_availability(value: str) -> str:
allowed = {"in_stock", "out_of_stock", "preorder"}
return value if value in allowed else "out_of_stock"
3. Image Handling
Google requires one primary image (g:image_link) and up to 10 additional images (g:additional_image_link).
We take the first image in the list as the primary link and the next nine (if present) as additional links.
def split_images(images: list) -> tuple[str | None, list[str]]:
"""
Returns (primary_image_url, list_of_additional_image_urls)
"""
if not images:
return None, []
primary = images[0].get("url")
additional = [img.get("url") for img in images[1:11] if img.get("url")]
return primary, additional
Phase 4 – Building the Converter Script
The pipeline streams the JSONL file, transforms each record, and writes a pretty‑printed XML feed using xml.etree.ElementTree.
import json
import xml.etree.ElementTree as ET
from xml.dom import minidom
# ----------------------------------------------------------------------
# Helper functions (price, availability, image handling)
# ----------------------------------------------------------------------
def format_gmc_price(amount, currency):
return f"{float(amount):.2f} {currency}"
def map_availability(value):
allowed = {"in_stock", "out_of_stock", "preorder"}
return value if value in allowed else "out_of_stock"
def split_images(images):
if not images:
return None, []
primary = images[0].get("url")
additional = [img.get("url") for img in images[1:11] if img.get("url")]
return primary, additional
# ----------------------------------------------------------------------
def create_gmc_feed(input_jsonl: str, output_xml: str) -> None:
"""Read a JSONL file line‑by‑line and write a Google Merchant Center RSS feed."""
# Namespace registration
g_ns = "http://base.google.com/ns/1.0"
ET.register_namespace('g', g_ns)
# Root element
rss = ET.Element("rss", version="2.0")
channel = ET.SubElement(rss, "channel")
ET.SubElement(channel, "title").text = "Ulta Product Feed"
ET.SubElement(channel, "link").text = "https://www.ulta.com"
ET.SubElement(channel, "description").text = "Daily product updates from Ulta"
# Stream the JSONL file
with open(input_jsonl, "r", encoding="utf-8") as f:
for line in f:
if not line.strip():
continue # skip empty lines
product = json.loads(line)
# Container
item = ET.SubElement(channel, "item")
# Basic required fields
ET.SubElement(item, f"{{{g_ns}}}id").text = str(product.get("productId"))
ET.SubElement(item, f"{{{g_ns}}}title").text = product.get("name", "")[:150]
ET.SubElement(item, f"{{{g_ns}}}description").text = product.get("description", "")
ET.SubElement(item, f"{{{g_ns}}}link").text = product.get("url")
ET.SubElement(item, f"{{{g_ns}}}brand").text = product.get("brand")
ET.SubElement(item, f"{{{g_ns}}}condition").text = "new"
# Price
price = product.get("price")
currency = product.get("currency", "USD")
if price is not None:
ET.SubElement(item, f"{{{g_ns}}}price").text = format_gmc_price(price, currency)
# Availability
ET.SubElement(item, f"{{{g_ns}}}availability").text = map_availability(
product.get("availability", "out_of_stock")
)
# Images
primary_img, additional_imgs = split_images(product.get("images", []))
if primary_img:
ET.SubElement(item, f"{{{g_ns}}}image_link").text = primary_img
for img_url in additional_imgs:
ET.SubElement(item, f"{{{g_ns}}}additional_image_link").text = img_url
# Serialize and pretty‑print
raw_xml = ET.tostring(rss, encoding="utf-8")
pretty_xml = minidom.parseString(raw_xml).toprettyxml(indent=" ")
# Write to file
with open(output_xml, "w", encoding="utf-8") as out_f:
out_f.write(pretty_xml)
# ----------------------------------------------------------------------
# Example usage
# ----------------------------------------------------------------------
if __name__ == "__main__":
create_gmc_feed("ulta_products.jsonl", "ulta_gmc_feed.xml")
What the script does
- Registers the
g:namespace required by Google. - Streams the input JSONL file line‑by‑line (constant memory).
- Maps each field according to the table in Phase 2, applying the helper functions for price, availability, and images.
- Writes a nicely indented XML file (
ulta_gmc_feed.xml) ready for upload to Google Merchant Center.
Phase 5 – Handling Edge Cases
Web data is messy. Here are three common issues encountered when processing Ulta scrapes:
- HTML in Descriptions – Ulta’s descriptions sometimes contain raw HTML tags or entities like
. While the scraper cleans most of this, it is safer to wrap the description in aCDATAsection or use a regex to strip remaining tags before inserting them into the XML. - Absolute URLs – Ensure your scraper uses the
make_absolute_urllogic from the repository. Google rejects relative URLs like/p/product-name. - Zero or Missing Prices – Occasionally, a product might show “Price Varies” or “Out of Stock” without a numerical value. The
:.2fformatting will fail ifpriceisNone. Always default to0.00or skip the item if the price is missing.
if __name__ == "__main__":
create_gmc_feed('ulta_data.jsonl', 'google_feed.xml')
To Wrap Up
Converting raw scraper data into a functional marketing asset turns raw data into business value. Bridging the gap between JSONL and GMC XML allows you to automate inventory updates directly from your scraping pipeline.
Key Takeaways
- Stream your data: Use JSONL and line‑by‑line processing to handle large datasets.
- Respect the schema: Google is strict about formatting. Always include the currency code in the price and map availability to their three specific enums.
- Automate the pipeline: Trigger this script immediately after your scraper finishes to create a hands‑off data‑to‑ads pipeline.
For more information on the initial extraction, check out the ScrapeOps Residential Proxy Aggregator and the full range of implementations in the Ulta.com‑Scrapers repository.