Columnar Storage Is Normalization
Source: Hacker News
Row‑oriented vs. column‑oriented representation
Consider this data:
data = [
{ "name": "Smudge", "colour": "black" },
{ "name": "Sissel", "colour": "grey" },
{ "name": "Hamlet", "colour": "black" }
]
This is a typical row‑oriented table. Adding a new row is straightforward:
{ "name": "Petee", "colour": "black" }
Only a few pages need to be touched on disk, regardless of how many columns the row has. Looking up a row is also fast because all of its columns are stored together.
However, computing a histogram of colour requires scanning the entire row data, even the name fields we don’t need.
A column‑oriented representation flips these trade‑offs:
data = {
"name": [
"Smudge",
"Sissel",
"Hamlet"
],
"colour": [
"black",
"grey",
"black"
],
}
Now we can read only the colour column to build the histogram, but inserting a new row or retrieving a specific row requires touching multiple column vectors.
Columnar storage as extreme normalization
Think of columnar data as a set of very narrow tables, each containing a primary key (or implicit position) plus a single attribute.
Denormalized table
| id | name | age |
|---|---|---|
| 12 | Bob | 30 |
| 93 | Tom | 35 |
| 27 | Kim | 28 |
Normalized tables
Name
| id | name |
|---|---|
| 12 | Bob |
| 93 | Tom |
| 27 | Kim |
Age
| id | age |
|---|---|
| 12 | 30 |
| 93 | 35 |
| 27 | 28 |
Reconstructing the original table is simply a join on id.
In a column‑stored table, the primary key corresponds to the ordinal position of each value.
Example with implicit IDs
Original columnar data:
data = {
"name": [
"Smudge",
"Sissel",
"Hamlet"
],
"colour": [
"black",
"grey",
"black"
],
}
Explicitly with IDs:
| id | name |
|---|---|
| 0 | Smudge |
| 1 | Sissel |
| 2 | Hamlet |
| id | colour |
|---|---|
| 0 | black |
| 1 | grey |
| 2 | black |
Since the id is implied by the array index, the tables can also be shown without it:
name column
| name |
|---|
| Smudge |
| Sissel |
| Hamlet |
colour column
| colour |
|---|
| black |
| grey |
| black |
Why this perspective matters
Viewing columnar storage as an extreme form of normalization unifies many query‑processing concepts—projections, joins, and data‑format manipulation. While queries are logically blind to the physical format, recognizing that “reconstructing a row from columnar storage” is essentially a join can provide a useful mental model for understanding performance characteristics.