Columnar Storage Is Normalization

Published: (April 22, 2026 at 08:30 AM EDT)
3 min read

Source: Hacker News

Row‑oriented vs. column‑oriented representation

Consider this data:

data = [
    { "name": "Smudge", "colour": "black" },
    { "name": "Sissel", "colour": "grey" },
    { "name": "Hamlet", "colour": "black" }
]

This is a typical row‑oriented table. Adding a new row is straightforward:

{ "name": "Petee", "colour": "black" }

Only a few pages need to be touched on disk, regardless of how many columns the row has. Looking up a row is also fast because all of its columns are stored together.

However, computing a histogram of colour requires scanning the entire row data, even the name fields we don’t need.

A column‑oriented representation flips these trade‑offs:

data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

Now we can read only the colour column to build the histogram, but inserting a new row or retrieving a specific row requires touching multiple column vectors.

Columnar storage as extreme normalization

Think of columnar data as a set of very narrow tables, each containing a primary key (or implicit position) plus a single attribute.

Denormalized table

idnameage
12Bob30
93Tom35
27Kim28

Normalized tables

Name

idname
12Bob
93Tom
27Kim

Age

idage
1230
9335
2728

Reconstructing the original table is simply a join on id.

In a column‑stored table, the primary key corresponds to the ordinal position of each value.

Example with implicit IDs

Original columnar data:

data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

Explicitly with IDs:

idname
0Smudge
1Sissel
2Hamlet
idcolour
0black
1grey
2black

Since the id is implied by the array index, the tables can also be shown without it:

name column

name
Smudge
Sissel
Hamlet

colour column

colour
black
grey
black

Why this perspective matters

Viewing columnar storage as an extreme form of normalization unifies many query‑processing concepts—projections, joins, and data‑format manipulation. While queries are logically blind to the physical format, recognizing that “reconstructing a row from columnar storage” is essentially a join can provide a useful mental model for understanding performance characteristics.

0 views
Back to Blog

Related posts

Read more »

Ubuntu 26.04 LTS Released

Release Overview Ubuntu 26.04 'Resolute Raccoon' LTS has been releasedhttps://discourse.ubuntu.com/t/ubuntu-26-04-resolute-raccoon-lts-released/80833 on schedu...