HTML Parsing Algorithm and Memory Structure

Published: (March 23, 2026 at 05:19 PM EDT)
7 min read
Source: Dev.to

Source: Dev.to

Ever wonder what actually happens between the moment your browser receives raw HTML bytes and the moment you see a page?

Most of us just load HTML files all day and never think about the machinery underneath. This is the first article in a series where we dig into that machinery. Our end goal is a working HTML parser and static site generator, written from scratch, for the pure joy of understanding how things work. No frameworks, no libraries—just us and the spec!


The HTML parser is a state machine

Rather than building a tree directly, the browser reads the HTML character‑by‑character and switches between states as it goes.

Analogy: Think of a traffic light. The light is always in one state—red, yellow, or green. Based on what happens (timer expires, car approaches), it transitions. The HTML parser works the same way: it is always in a specific state, and the next character determines where it goes.

  1. Data state – the default.
  2. When a is left open before another starts, the parser automatically closes the first one.
  • Illegal nesting is rearranged into a valid structure.

Special cases:

  • Void elements (, “, …) never have closing tags.
  • Elements like “ have special insertion modes that affect how subsequent elements are treated.

Reference: WHATWG HTML Spec — Tree Construction


How the DOM is stored in memory

The DOM tree lives as a graph data structure in the browser’s heap. Each element becomes a node object.

ConceptDescription
Node ObjectsContain tag name, attributes, text content, and pointers to related nodes.
Parent‑Child RelationshipsEach node holds a reference to its parent; parents keep references to their children.
Sibling RelationshipsNodes track nextSibling and previousSibling via pointers (exposed as node.nextSibling, node.previousSibling, node.firstChild).
Linked‑List StructureChildren of a parent are stored as a linked list; nodes are scattered across the heap, linked by pointers.
Text NodesText is wrapped in separate node objects (`
Hello
` → a “ element node + a text node containing “Hello”).
Attributes StorageEach element has an internal map of attributes. element.getAttribute('class') looks up this map.
Memory EfficiencyEngines like V8 intern strings: identical strings are stored once in a StringTable, so multiple elements can reference the same string object.

Reference: V8 StringTable implementation — Zhenghao He, “JavaScript memory model demystified”


End‑to‑end parsing flow

  1. Initial State – a buffer holds the raw HTML; a pointer starts at position 0.
  2. Tokenization – the state machine reads characters and emits tokens (e.g., StartTag: div, attributes: class="container"). Tokens are consumed immediately; they don’t linger.
  3. Tree Building – each token is fed to the tree‑construction algorithm:
    • Start tag → create a node, append it to the current open element, push it onto the stack.
    • End tag → pop elements until a match is found.
  4. Incremental Growth – the tree expands node by node: document → → → … deeper nesting follows.
  5. Memory Allocation – every new node gets heap memory; its parent receives a pointer to that location; sibling linked‑list pointers are updated.
  6. Script Execution – encountering a “ pauses tokenization, downloads & runs the script, which may modify the DOM (add/remove/rearrange nodes). Parsing resumes after the script finishes.
  7. Incremental Parsing – browsers don’t wait for the whole file. As bytes arrive, they’re fed straight into the parser, allowing progressive rendering. The spec calls this speculative HTML parsing.
  8. Whitespace Handling – (details omitted for brevity).

TL;DR

  • The browser parses HTML with a state‑machine tokenizer (≈80 states).
  • Tokens feed a tree‑construction algorithm that builds the DOM while handling errors and auto‑closing tags.
  • The resulting DOM is a graph of node objects linked via pointers, stored efficiently in the heap (string interning, linked‑list children, etc.).
  • Parsing is incremental and can be paused for script execution, enabling progressive page rendering.

Feel free to dive into the referenced specifications for deeper details!

Text Nodes and Whitespace

xt nodes get created for whitespace between tags. A line break and spaces between and produce a text node containing just whitespace. Some browsers optimise by collapsing insignificant whitespace, but the spec preserves it.


Document Fragment

During parsing, the tree is a mutable data structure in memory. After parsing completes, scripts can still modify the tree in real‑time — the in‑memory structure updates immediately.


Separation of Concerns

The elegance here is the separation of concerns:

  • State machine – handles the syntactic complexity of HTML.
  • Tree‑construction algorithm – handles semantic relationships and error recovery.
  • In‑memory tree structure – provides efficient access for rendering and scripting.

Reference: WHATWG HTML Spec — Speculative HTML Parsing; MDN — Memory Management


Example HTML Page


  Simple Page

  Header Content
  Div One
  Div Two
  Footer Content

How the DOM Tree Is Connected in Memory

Each block below represents a separate object on the heap, linked by pointers. The addresses are illustrative; real addresses aren’t this tidy, and real node objects are much larger than 8 bytes each.

HEAP:

[0xA00: Document node]
  └─ children: [0xA10]

[0xA10: html-node { parent: 0xA00, children: [0xB00, 0xC00] }]
  ├─ firstChild → 0xB00 (head)
  └─ lastChild  → 0xC00 (body)

[0xB00: head-node { parent: 0xA10, children: [0xB20] }]
  └─ [0xB20: title-node { parent: 0xB00, children: [0xB30] }]
       └─ [0xB30: text "Simple Page" { parent: 0xB20 }]

[0xC00: body-node { parent: 0xA10, children: [0xD00, 0xD40, 0xD80, 0xDC0] }]
  └─ firstChild → 0xD00

[0xD00: header-node { parent: 0xC00, nextSibling: 0xD40 }]
  └─ [0xD10: text "Header Content" { parent: 0xD00 }]

[0xD40: div-node { parent: 0xC00, prevSibling: 0xD00, nextSibling: 0xD80 }]
  └─ [0xD50: text "Div One" { parent: 0xD40 }]

[0xD80: div-node { parent: 0xC00, prevSibling: 0xD40, nextSibling: 0xDC0 }]
  └─ [0xD90: text "Div Two" { parent: 0xD80 }]

[0xDC0: footer-node { parent: 0xC00, prevSibling: 0xD80 }]
  └─ [0xDD0: text "Footer Content" { parent: 0xDC0 }]

Each block is a separate object allocated somewhere on the heap. The hex addresses (e.g., 0xD40) are what the pointers actually store in memory.

  • The html-node at 0xA10 has no parent beyond the document root. It points to two children: head at 0xB00 and body at 0xC00.
  • nextSibling: 0xD80 on the first means “the next sibling lives at address `0xD80`”, which is the second.
  • The text node “Div One” at 0xD50 is a separate object; its parent pointer leads to 0xD40 (the first “).

When the browser needs to go from to, it follows the pointer 0xC00, jumps to that address, and finds the body node object there. These objects are scattered across the heap, but the pointers weave them into one coherent tree. The physical layout in memory doesn’t matter; the logical structure is defined entirely by the pointers.


What’s Next?

Now that we understand how browsers do it, it’s time to get our hands dirty.

Roadmap for the Series

  1. HTML Parsing Algorithms and Memory StructureYou are here.
  2. Building a Tokenizer from Scratch – Write a state machine that turns raw HTML into tokens, character by character.
  3. From Tokens to Tree: Implementing a DOM Builder – Use the tokens to construct an in‑memory tree with the stack‑based algorithm and basic error recovery.
  4. A Minimal Static Site Generator – Read HTML templates, inject content, and output a complete static site.
  5. Rendering to the Terminal (Bonus) – Walk the DOM tree and render a simplified version of the page right in the console.

The goal isn’t to compete with Chrome. It’s to build something small, understand every line, and have fun doing it. See you in the next one!


Which Language Should We Use?

  • C99?
  • Python?
  • FASM?
  • Forth?
0 views
Back to Blog

Related posts

Read more »