Feed Rescue: Converting Raw Ulta Scrapes into Google Merchant Center XML

Published: (February 27, 2026 at 09:45 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

Phase 1 – Analyzing the Source Data

Before we generate any XML we need to understand the raw material.
The Ulta.com‑Scrapers repository (Selenium & Playwright versions) outputs a JSONL file where each line conforms to the same ScrapedData dataclass.

Typical raw record:

{
  "productId": "2583561",
  "name": "CeraVe - Hydrating Facial Cleanser",
  "brand": "CeraVe",
  "price": 17.99,
  "currency": "USD",
  "availability": "in_stock",
  "description": "Hydrating Facial Cleanser gently removes dirt...",
  "images": [
    {
      "url": "https://media.ulta.com/i/ulta/2583561?w=2000&h=2000",
      "altText": "Product Image"
    }
  ],
  "url": "https://www.ulta.com/p/hydrating-facial-cleanser-pimprod2001719"
}

Why JSONL?
Because it is line‑delimited, we can stream the file product‑by‑product without loading the entire (potentially 500 MB) dataset into memory.

Phase 2 – Google Merchant Center Specification

Google Shopping accepts only an RSS 2.0 (or Atom) feed that uses the g: namespace. Below is the required mapping from the Ulta scraper fields to the GMC XML tags.

Ulta Scraper FieldGMC XML TagRequirement / Notes
productIdg:idUnique identifier
nameg:titleMax 150 characters
descriptiong:descriptionPlain text, no broken HTML
urlg:linkAbsolute URL
images[0]['url']g:image_linkHigh‑resolution primary image
price + currencyg:priceFormat: 17.99 USD
availabilityg:availabilityMust be one of in_stock, out_of_stock, preorder
brandg:brandOptional but recommended
(static)g:conditionSet to new for all products

Phase 3 – Field Mapping & Transformation Logic

1. Price Normalization

def format_gmc_price(amount: float, currency: str) -> str:
    """
    Return a price string in the format required by Google Merchant Center:
    e.g. 17.99 USD
    """
    return f"{amount:.2f} {currency}"

Ensures two‑decimal precision and appends the ISO‑4217 currency code.

2. Availability Mapping

The scraper already returns in_stock, out_of_stock, or preorder.
For safety we map any unexpected value to out_of_stock:

def map_availability(value: str) -> str:
    allowed = {"in_stock", "out_of_stock", "preorder"}
    return value if value in allowed else "out_of_stock"

3. Image Handling

Google requires one primary image (g:image_link) and up to 10 additional images (g:additional_image_link).
We take the first image in the list as the primary link and the next nine (if present) as additional links.

def split_images(images: list) -> tuple[str | None, list[str]]:
    """
    Returns (primary_image_url, list_of_additional_image_urls)
    """
    if not images:
        return None, []
    primary = images[0].get("url")
    additional = [img.get("url") for img in images[1:11] if img.get("url")]
    return primary, additional

Phase 4 – Building the Converter Script

The pipeline streams the JSONL file, transforms each record, and writes a pretty‑printed XML feed using xml.etree.ElementTree.

import json
import xml.etree.ElementTree as ET
from xml.dom import minidom

# ----------------------------------------------------------------------
# Helper functions (price, availability, image handling)
# ----------------------------------------------------------------------
def format_gmc_price(amount, currency):
    return f"{float(amount):.2f} {currency}"

def map_availability(value):
    allowed = {"in_stock", "out_of_stock", "preorder"}
    return value if value in allowed else "out_of_stock"

def split_images(images):
    if not images:
        return None, []
    primary = images[0].get("url")
    additional = [img.get("url") for img in images[1:11] if img.get("url")]
    return primary, additional

# ----------------------------------------------------------------------
def create_gmc_feed(input_jsonl: str, output_xml: str) -> None:
    """Read a JSONL file line‑by‑line and write a Google Merchant Center RSS feed."""
    # Namespace registration
    g_ns = "http://base.google.com/ns/1.0"
    ET.register_namespace('g', g_ns)

    # Root element
    rss = ET.Element("rss", version="2.0")
    channel = ET.SubElement(rss, "channel")
    ET.SubElement(channel, "title").text = "Ulta Product Feed"
    ET.SubElement(channel, "link").text = "https://www.ulta.com"
    ET.SubElement(channel, "description").text = "Daily product updates from Ulta"

    # Stream the JSONL file
    with open(input_jsonl, "r", encoding="utf-8") as f:
        for line in f:
            if not line.strip():
                continue  # skip empty lines
            product = json.loads(line)

            # Container
            item = ET.SubElement(channel, "item")

            # Basic required fields
            ET.SubElement(item, f"{{{g_ns}}}id").text = str(product.get("productId"))
            ET.SubElement(item, f"{{{g_ns}}}title").text = product.get("name", "")[:150]
            ET.SubElement(item, f"{{{g_ns}}}description").text = product.get("description", "")
            ET.SubElement(item, f"{{{g_ns}}}link").text = product.get("url")
            ET.SubElement(item, f"{{{g_ns}}}brand").text = product.get("brand")
            ET.SubElement(item, f"{{{g_ns}}}condition").text = "new"

            # Price
            price = product.get("price")
            currency = product.get("currency", "USD")
            if price is not None:
                ET.SubElement(item, f"{{{g_ns}}}price").text = format_gmc_price(price, currency)

            # Availability
            ET.SubElement(item, f"{{{g_ns}}}availability").text = map_availability(
                product.get("availability", "out_of_stock")
            )

            # Images
            primary_img, additional_imgs = split_images(product.get("images", []))
            if primary_img:
                ET.SubElement(item, f"{{{g_ns}}}image_link").text = primary_img
            for img_url in additional_imgs:
                ET.SubElement(item, f"{{{g_ns}}}additional_image_link").text = img_url

    # Serialize and pretty‑print
    raw_xml = ET.tostring(rss, encoding="utf-8")
    pretty_xml = minidom.parseString(raw_xml).toprettyxml(indent="  ")

    # Write to file
    with open(output_xml, "w", encoding="utf-8") as out_f:
        out_f.write(pretty_xml)

# ----------------------------------------------------------------------
# Example usage
# ----------------------------------------------------------------------
if __name__ == "__main__":
    create_gmc_feed("ulta_products.jsonl", "ulta_gmc_feed.xml")

What the script does

  1. Registers the g: namespace required by Google.
  2. Streams the input JSONL file line‑by‑line (constant memory).
  3. Maps each field according to the table in Phase 2, applying the helper functions for price, availability, and images.
  4. Writes a nicely indented XML file (ulta_gmc_feed.xml) ready for upload to Google Merchant Center.

Phase 5 – Handling Edge Cases

Web data is messy. Here are three common issues encountered when processing Ulta scrapes:

  1. HTML in Descriptions – Ulta’s descriptions sometimes contain raw HTML tags or entities like  . While the scraper cleans most of this, it is safer to wrap the description in a CDATA section or use a regex to strip remaining tags before inserting them into the XML.
  2. Absolute URLs – Ensure your scraper uses the make_absolute_url logic from the repository. Google rejects relative URLs like /p/product-name.
  3. Zero or Missing Prices – Occasionally, a product might show “Price Varies” or “Out of Stock” without a numerical value. The :.2f formatting will fail if price is None. Always default to 0.00 or skip the item if the price is missing.
if __name__ == "__main__":
    create_gmc_feed('ulta_data.jsonl', 'google_feed.xml')

To Wrap Up

Converting raw scraper data into a functional marketing asset turns raw data into business value. Bridging the gap between JSONL and GMC XML allows you to automate inventory updates directly from your scraping pipeline.

Key Takeaways

  • Stream your data: Use JSONL and line‑by‑line processing to handle large datasets.
  • Respect the schema: Google is strict about formatting. Always include the currency code in the price and map availability to their three specific enums.
  • Automate the pipeline: Trigger this script immediately after your scraper finishes to create a hands‑off data‑to‑ads pipeline.

For more information on the initial extraction, check out the ScrapeOps Residential Proxy Aggregator and the full range of implementations in the Ulta.com‑Scrapers repository.

0 views
Back to Blog

Related posts

Read more »