HTML Parsing Algorithm and Memory Structure
Source: Dev.to
Ever wonder what actually happens between the moment your browser receives raw HTML bytes and the moment you see a page?
Most of us just load HTML files all day and never think about the machinery underneath. This is the first article in a series where we dig into that machinery. Our end goal is a working HTML parser and static site generator, written from scratch, for the pure joy of understanding how things work. No frameworks, no libraries—just us and the spec!
The HTML parser is a state machine
Rather than building a tree directly, the browser reads the HTML character‑by‑character and switches between states as it goes.
Analogy: Think of a traffic light. The light is always in one state—red, yellow, or green. Based on what happens (timer expires, car approaches), it transitions. The HTML parser works the same way: it is always in a specific state, and the next character determines where it goes.
- Data state – the default.
- When a
is left open before anotherstarts, the parser automatically closes the first one.
- Illegal nesting is rearranged into a valid structure.
Special cases:
- Void elements (
, “, …) never have closing tags. - Elements like “ have special insertion modes that affect how subsequent elements are treated.
Reference: WHATWG HTML Spec — Tree Construction
How the DOM is stored in memory
The DOM tree lives as a graph data structure in the browser’s heap. Each element becomes a node object.
| Concept | Description |
|---|---|
| Node Objects | Contain tag name, attributes, text content, and pointers to related nodes. |
| Parent‑Child Relationships | Each node holds a reference to its parent; parents keep references to their children. |
| Sibling Relationships | Nodes track nextSibling and previousSibling via pointers (exposed as node.nextSibling, node.previousSibling, node.firstChild). |
| Linked‑List Structure | Children of a parent are stored as a linked list; nodes are scattered across the heap, linked by pointers. |
| Text Nodes | Text is wrapped in separate node objects (` |
| Hello | |
| ` → a “ element node + a text node containing “Hello”). | |
| Attributes Storage | Each element has an internal map of attributes. element.getAttribute('class') looks up this map. |
| Memory Efficiency | Engines like V8 intern strings: identical strings are stored once in a StringTable, so multiple elements can reference the same string object. |
Reference: V8 StringTable implementation — Zhenghao He, “JavaScript memory model demystified”
End‑to‑end parsing flow
- Initial State – a buffer holds the raw HTML; a pointer starts at position 0.
- Tokenization – the state machine reads characters and emits tokens (e.g.,
StartTag: div, attributes: class="container"). Tokens are consumed immediately; they don’t linger. - Tree Building – each token is fed to the tree‑construction algorithm:
- Start tag → create a node, append it to the current open element, push it onto the stack.
- End tag → pop elements until a match is found.
- Incremental Growth – the tree expands node by node:
document → → →… deeper nesting follows. - Memory Allocation – every new node gets heap memory; its parent receives a pointer to that location; sibling linked‑list pointers are updated.
- Script Execution – encountering a “ pauses tokenization, downloads & runs the script, which may modify the DOM (add/remove/rearrange nodes). Parsing resumes after the script finishes.
- Incremental Parsing – browsers don’t wait for the whole file. As bytes arrive, they’re fed straight into the parser, allowing progressive rendering. The spec calls this speculative HTML parsing.
- Whitespace Handling – (details omitted for brevity).
TL;DR
- The browser parses HTML with a state‑machine tokenizer (≈80 states).
- Tokens feed a tree‑construction algorithm that builds the DOM while handling errors and auto‑closing tags.
- The resulting DOM is a graph of node objects linked via pointers, stored efficiently in the heap (string interning, linked‑list children, etc.).
- Parsing is incremental and can be paused for script execution, enabling progressive page rendering.
Feel free to dive into the referenced specifications for deeper details!
Text Nodes and Whitespace
xt nodes get created for whitespace between tags. A line break and spaces between and produce a text node containing just whitespace. Some browsers optimise by collapsing insignificant whitespace, but the spec preserves it.
Document Fragment
During parsing, the tree is a mutable data structure in memory. After parsing completes, scripts can still modify the tree in real‑time — the in‑memory structure updates immediately.
Separation of Concerns
The elegance here is the separation of concerns:
- State machine – handles the syntactic complexity of HTML.
- Tree‑construction algorithm – handles semantic relationships and error recovery.
- In‑memory tree structure – provides efficient access for rendering and scripting.
Reference: WHATWG HTML Spec — Speculative HTML Parsing; MDN — Memory Management
Example HTML Page
Simple Page
Header Content
Div One
Div Two
Footer Content
How the DOM Tree Is Connected in Memory
Each block below represents a separate object on the heap, linked by pointers. The addresses are illustrative; real addresses aren’t this tidy, and real node objects are much larger than 8 bytes each.
HEAP:
[0xA00: Document node]
└─ children: [0xA10]
[0xA10: html-node { parent: 0xA00, children: [0xB00, 0xC00] }]
├─ firstChild → 0xB00 (head)
└─ lastChild → 0xC00 (body)
[0xB00: head-node { parent: 0xA10, children: [0xB20] }]
└─ [0xB20: title-node { parent: 0xB00, children: [0xB30] }]
└─ [0xB30: text "Simple Page" { parent: 0xB20 }]
[0xC00: body-node { parent: 0xA10, children: [0xD00, 0xD40, 0xD80, 0xDC0] }]
└─ firstChild → 0xD00
[0xD00: header-node { parent: 0xC00, nextSibling: 0xD40 }]
└─ [0xD10: text "Header Content" { parent: 0xD00 }]
[0xD40: div-node { parent: 0xC00, prevSibling: 0xD00, nextSibling: 0xD80 }]
└─ [0xD50: text "Div One" { parent: 0xD40 }]
[0xD80: div-node { parent: 0xC00, prevSibling: 0xD40, nextSibling: 0xDC0 }]
└─ [0xD90: text "Div Two" { parent: 0xD80 }]
[0xDC0: footer-node { parent: 0xC00, prevSibling: 0xD80 }]
└─ [0xDD0: text "Footer Content" { parent: 0xDC0 }]Each block is a separate object allocated somewhere on the heap. The hex addresses (e.g., 0xD40) are what the pointers actually store in memory.
- The
html-nodeat0xA10has no parent beyond the document root. It points to two children:headat0xB00andbodyat0xC00. nextSibling: 0xD80on the firstmeans “the next sibling lives at address `0xD80`”, which is the second.- The text node “Div One” at
0xD50is a separate object; itsparentpointer leads to0xD40(the first “).
When the browser needs to go from to, it follows the pointer 0xC00, jumps to that address, and finds the body node object there. These objects are scattered across the heap, but the pointers weave them into one coherent tree. The physical layout in memory doesn’t matter; the logical structure is defined entirely by the pointers.
What’s Next?
Now that we understand how browsers do it, it’s time to get our hands dirty.
Roadmap for the Series
- HTML Parsing Algorithms and Memory Structure – You are here.
- Building a Tokenizer from Scratch – Write a state machine that turns raw HTML into tokens, character by character.
- From Tokens to Tree: Implementing a DOM Builder – Use the tokens to construct an in‑memory tree with the stack‑based algorithm and basic error recovery.
- A Minimal Static Site Generator – Read HTML templates, inject content, and output a complete static site.
- Rendering to the Terminal (Bonus) – Walk the DOM tree and render a simplified version of the page right in the console.
The goal isn’t to compete with Chrome. It’s to build something small, understand every line, and have fun doing it. See you in the next one!
Which Language Should We Use?
- C99?
- Python?
- FASM?
- Forth?