Building a Resilient Meta Tag Analyzer with DOMParser and Serverless

Published: (January 18, 2026 at 01:23 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Building SEO tools often sounds straightforward—until you hit the two walls of modern web scraping: Cross‑Origin Resource Sharing (CORS) and the messiness of parsing arbitrary HTML.

I recently built a Meta Tag Analyzer to help developers debug their Open Graph and Twitter Card tags. The goal was simple: take a URL, fetch its source code, and visualize exactly how social platforms see the page.

Below is a technical breakdown of the data‑fetching architecture and, more importantly, how to parse HTML safely in the browser without heavy libraries like Cheerio or JSDOM.

The Problem: CORS and the “Regex for HTML” Trap

1. The CORS Block

You cannot simply do:

fetch('https://example.com')

from the browser. The browser’s security policy will block the request because the target domain does not send an Access-Control-Allow-Origin header for your site.

2. Parsing Strategy

Once you obtain the HTML (usually via a proxy), you have a massive string of text. Beginners often try to use regex to extract “ tags. As the famous StackOverflow post warns, parsing HTML with regex is a bad idea—it breaks on unclosed tags, comments, or unexpected line breaks.

The Solution: A Proxy + DOMParser Architecture

1. Serverless Proxy

A lightweight serverless function acts as a tunnel:

  • Accepts a target URL.
  • Fetches the content server‑side (where CORS doesn’t apply).
  • Returns the raw HTML string to the frontend.

2. Native DOMParser

On the client side, instead of pulling in a heavy parsing library, I use the browser’s built‑in DOMParser API. It converts an HTML string into a manipulable DOM document without executing scripts or loading external resources (images, CSS, etc.).

The Code: Parsing HTML Strings Safely

/**
 * Extracts meta tags from a raw HTML string using the DOMParser API.
 *
 * @param {string} rawHtml - The HTML string fetched from the proxy.
 * @returns {object} - An object containing standard, OG, and Twitter metadata.
 */
const extractMetaData = (rawHtml) => {
  // 1. Initialise the DOMParser
  const parser = new DOMParser();

  // 2. Parse the string into a Document.
  //    'text/html' ensures it parses as HTML, forgiving syntax errors.
  const doc = parser.parseFromString(rawHtml, "text/html");

  // Helper to safely get content from a selector
  const getMeta = (selector, attribute = "content") => {
    const element = doc.querySelector(selector);
    return element ? element.getAttribute(attribute) : null;
  };

  // 3. Extract Data
  //    querySelector handles fallback logic efficiently
  const data = {
    title: doc.title || getMeta('meta[property="og:title"]'),
    description:
      getMeta('meta[name="description"]') ||
      getMeta('meta[property="og:description"]'),

    // Open Graph specifics
    og: {
      image: getMeta('meta[property="og:image"]'),
      url: getMeta('meta[property="og:url"]'),
      type: getMeta('meta[property="og:type"]'),
    },

    // Twitter Card specifics
    twitter: {
      card: getMeta('meta[name="twitter:card"]'),
      creator: getMeta('meta[name="twitter:creator"]'),
    },

    // Technical SEO
    robots: getMeta('meta[name="robots"]'),
    viewport: getMeta('meta[name="viewport"]'),
    canonical: getMeta('link[rel="canonical"]', "href"),
  };

  return data;
};

Why This Approach Works Well

AspectBenefit
SecurityDOMParser creates an inert document. Scripts inside rawHtml are marked non‑executable, preventing XSS during analysis.
PerformanceOnly the HTML string is parsed; no network requests for images, CSS, or fonts are made.
ResilienceBrowsers are tolerant of malformed HTML. DOMParser handles missing closing tags just like a real browser, so the scraper doesn’t crash on broken pages.

Live Demo

Try it yourself: NasajTools – Meta Tag Analyzer

Enter any URL (e.g., github.com) to see the DOMParser extraction in real‑time.

Performance Considerations

While testing, I ran into massive HTML pages (some legacy sites serve 2 MB+ files). To keep the UI responsive, I applied two optimisations:

1. Request Abort

On the proxy side, I set a strict 3‑second timeout. SEO bots rarely wait longer, so aborting after that is realistic.

2. Content‑Length Check & Head‑Only Parsing

Meta tags are almost always located inside the <head>. If the HTML string exceeds a safe threshold, I slice it to the first 100 KB (or up to </head> if present) before feeding it to DOMParser.

// Optimization: Only parse the head if the file is massive
const MAX_SIZE = 100_000; // 100 KB
if (rawHtml.length > MAX_SIZE) {
  // Cut off after the closing </head> tag to keep it valid
  const headEnd = rawHtml.indexOf('</head>');
  if (headEnd !== -1) {
    rawHtml = rawHtml.substring(0, headEnd + 7); // include '</head>'
  }
}

This truncation strategy dramatically reduced processing time on low‑end mobile devices during my testing.

TL;DR

  • CORS → solve with a serverless proxy.
  • HTML parsing → avoid regex; use the native DOMParser.
  • Performance → abort long requests, and parse only the <head> when the payload is huge.

Give it a spin, and you’ll see how a few lines of vanilla JavaScript can power a reliable, secure, and fast SEO meta‑tag analyzer. 🚀

Hopefully, this helps you if you are looking to build client‑side scrapers or analyzers!

https://nasajtools.com/tools/seo/meta-tag-analyzer.html

Back to Blog

Related posts

Read more »