Columnar Storage Is Normalization

Published: 2 days ago (April 22, 2026 at 08:30 AM EDT)

3 min read

Source: Hacker News

Row‑oriented vs. column‑oriented representation

Consider this data:

data = [
    { "name": "Smudge", "colour": "black" },
    { "name": "Sissel", "colour": "grey" },
    { "name": "Hamlet", "colour": "black" }
]

This is a typical row‑oriented table. Adding a new row is straightforward:

{ "name": "Petee", "colour": "black" }

Only a few pages need to be touched on disk, regardless of how many columns the row has. Looking up a row is also fast because all of its columns are stored together.

However, computing a histogram of colour requires scanning the entire row data, even the name fields we don’t need.

A column‑oriented representation flips these trade‑offs:

data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

Now we can read only the colour column to build the histogram, but inserting a new row or retrieving a specific row requires touching multiple column vectors.

Columnar storage as extreme normalization

Think of columnar data as a set of very narrow tables, each containing a primary key (or implicit position) plus a single attribute.

Denormalized table

id	name	age
12	Bob	30
93	Tom	35
27	Kim	28

Normalized tables

Name

id	name
12	Bob
93	Tom
27	Kim

Age

id	age
12	30
93	35
27	28

Reconstructing the original table is simply a join on id.

In a column‑stored table, the primary key corresponds to the ordinal position of each value.

Example with implicit IDs

Original columnar data:

data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

Explicitly with IDs:

id	name
0	Smudge
1	Sissel
2	Hamlet

id	colour
0	black
1	grey
2	black

Since the id is implied by the array index, the tables can also be shown without it:

name column

name
Smudge
Sissel
Hamlet

colour column

colour
black
grey
black

Why this perspective matters

Viewing columnar storage as an extreme form of normalization unifies many query‑processing concepts—projections, joins, and data‑format manipulation. While queries are logically blind to the physical format, recognizing that “reconstructing a row from columnar storage” is essentially a join can provide a useful mental model for understanding performance characteristics.

Columnar Storage Is Normalization

Row‑oriented vs. column‑oriented representation

Columnar storage as extreme normalization

Denormalized table

Normalized tables

Name

Age

Example with implicit IDs

name column

colour column

Why this perspective matters

Related posts

🚀 REST vs gRPC Performance in Go: A Practical Benchmark-Driven Guide

Building VidDrop. What I learned about performance and user experience.

Ubuntu 26.04 LTS Released

I Built a Cosmetic Ingredients Dataset — Here’s What I Found