Building a Resilient Meta Tag Analyzer with DOMParser and Serverless
Source: Dev.to
Building SEO tools often sounds straightforward—until you hit the two walls of modern web scraping: Cross‑Origin Resource Sharing (CORS) and the messiness of parsing arbitrary HTML.
I recently built a Meta Tag Analyzer to help developers debug their Open Graph and Twitter Card tags. The goal was simple: take a URL, fetch its source code, and visualize exactly how social platforms see the page.
Below is a technical breakdown of the data‑fetching architecture and, more importantly, how to parse HTML safely in the browser without heavy libraries like Cheerio or JSDOM.
The Problem: CORS and the “Regex for HTML” Trap
1. The CORS Block
You cannot simply do:
fetch('https://example.com')
from the browser. The browser’s security policy will block the request because the target domain does not send an Access-Control-Allow-Origin header for your site.
2. Parsing Strategy
Once you obtain the HTML (usually via a proxy), you have a massive string of text. Beginners often try to use regex to extract “ tags. As the famous StackOverflow post warns, parsing HTML with regex is a bad idea—it breaks on unclosed tags, comments, or unexpected line breaks.
The Solution: A Proxy + DOMParser Architecture
1. Serverless Proxy
A lightweight serverless function acts as a tunnel:
- Accepts a target URL.
- Fetches the content server‑side (where CORS doesn’t apply).
- Returns the raw HTML string to the frontend.
2. Native DOMParser
On the client side, instead of pulling in a heavy parsing library, I use the browser’s built‑in DOMParser API. It converts an HTML string into a manipulable DOM document without executing scripts or loading external resources (images, CSS, etc.).
The Code: Parsing HTML Strings Safely
/**
* Extracts meta tags from a raw HTML string using the DOMParser API.
*
* @param {string} rawHtml - The HTML string fetched from the proxy.
* @returns {object} - An object containing standard, OG, and Twitter metadata.
*/
const extractMetaData = (rawHtml) => {
// 1. Initialise the DOMParser
const parser = new DOMParser();
// 2. Parse the string into a Document.
// 'text/html' ensures it parses as HTML, forgiving syntax errors.
const doc = parser.parseFromString(rawHtml, "text/html");
// Helper to safely get content from a selector
const getMeta = (selector, attribute = "content") => {
const element = doc.querySelector(selector);
return element ? element.getAttribute(attribute) : null;
};
// 3. Extract Data
// querySelector handles fallback logic efficiently
const data = {
title: doc.title || getMeta('meta[property="og:title"]'),
description:
getMeta('meta[name="description"]') ||
getMeta('meta[property="og:description"]'),
// Open Graph specifics
og: {
image: getMeta('meta[property="og:image"]'),
url: getMeta('meta[property="og:url"]'),
type: getMeta('meta[property="og:type"]'),
},
// Twitter Card specifics
twitter: {
card: getMeta('meta[name="twitter:card"]'),
creator: getMeta('meta[name="twitter:creator"]'),
},
// Technical SEO
robots: getMeta('meta[name="robots"]'),
viewport: getMeta('meta[name="viewport"]'),
canonical: getMeta('link[rel="canonical"]', "href"),
};
return data;
};
Why This Approach Works Well
| Aspect | Benefit |
|---|---|
| Security | DOMParser creates an inert document. Scripts inside rawHtml are marked non‑executable, preventing XSS during analysis. |
| Performance | Only the HTML string is parsed; no network requests for images, CSS, or fonts are made. |
| Resilience | Browsers are tolerant of malformed HTML. DOMParser handles missing closing tags just like a real browser, so the scraper doesn’t crash on broken pages. |
Live Demo
Try it yourself: NasajTools – Meta Tag Analyzer
Enter any URL (e.g., github.com) to see the DOMParser extraction in real‑time.
Performance Considerations
While testing, I ran into massive HTML pages (some legacy sites serve 2 MB+ files). To keep the UI responsive, I applied two optimisations:
1. Request Abort
On the proxy side, I set a strict 3‑second timeout. SEO bots rarely wait longer, so aborting after that is realistic.
2. Content‑Length Check & Head‑Only Parsing
Meta tags are almost always located inside the <head>. If the HTML string exceeds a safe threshold, I slice it to the first 100 KB (or up to </head> if present) before feeding it to DOMParser.
// Optimization: Only parse the head if the file is massive
const MAX_SIZE = 100_000; // 100 KB
if (rawHtml.length > MAX_SIZE) {
// Cut off after the closing </head> tag to keep it valid
const headEnd = rawHtml.indexOf('</head>');
if (headEnd !== -1) {
rawHtml = rawHtml.substring(0, headEnd + 7); // include '</head>'
}
}
This truncation strategy dramatically reduced processing time on low‑end mobile devices during my testing.
TL;DR
- CORS → solve with a serverless proxy.
- HTML parsing → avoid regex; use the native
DOMParser. - Performance → abort long requests, and parse only the
<head>when the payload is huge.
Give it a spin, and you’ll see how a few lines of vanilla JavaScript can power a reliable, secure, and fast SEO meta‑tag analyzer. 🚀
Hopefully, this helps you if you are looking to build client‑side scrapers or analyzers!