Building a B-tree in Go: The gap between theory and implementation
Source: Dev.to
Why a B‑Tree?
- Most database indexes are B‑trees (or B+‑trees).
- Unlike binary trees (2 children per node), a B‑tree node can have many children, determined by the order.
- This multi‑way branching makes B‑trees ideal for disk storage where each block read is expensive.
I won’t re‑explain B‑trees here – chapter 2 of Database Internals does that brilliantly. This article assumes you already know the basics and focuses on what I learned while building one.
Implementation Highlights
Core Operations
- Binary search inside nodes → recurse to the appropriate child.
- Node split – keep the tree balanced: find the median, split into two halves, promote the median key.
- Root split – special case: create a brand‑new root containing the promoted key, making the tree taller rather than wider.
Gotchas
| Issue | What I learned |
|---|---|
| Split cascade | Inserting a key can trigger a cascade of splits up the tree. You must update every parent‑child pointer correctly. |
| Internal node split | Not just moving one or two keys – you must rewire the entire sub‑graph of children and fix their parent pointers. |
| B‑tree vs B+‑tree | B‑trees store data in every node; B+‑trees store data only in leaves (with linked leaves for fast range scans). My implementation stopped the search as soon as the key was found, which is elegant but makes range queries expensive. |
| Skewed inserts | Inserting sequential keys (1, 2, 3, …) produced a 7‑level deep tree with only 50 keys – a pathological case that highlighted the need for bulk‑load strategies and REINDEX. |
| Random inserts | Produced a nicely balanced 3‑level tree, showing the natural self‑balancing property of B‑trees. |
Test‑Driven Development (TDD)
I wrote 20+ unit tests (100 % coverage) before any implementation code. The tests forced me to think through edge cases and defined “done” up front.
Bugs Caught by Tests
- Duplicate‑key insertion (forgot to check for existing keys).
- Off‑by‑one errors on index bounds.
- Children‑count mismatch after splits (
nkeys →n+1children). - Incorrect parent pointers after a split cascade.
When all tests finally turned green, the confidence was immediate – no crossed fingers, just certainty.
Stats
| Metric | Value |
|---|---|
| Time spent | ~15 hours across 4 days |
| Code size | ~200 lines of Go |
| Tests | 20+ unit tests, 100 % coverage |
| Bugs found | Too many to count (all caught by TDD) |
| Coffee consumed | Significant ☕️ |
Reflections
- Hands‑on learning beats passive reading. Seeing the tree’s pointers change in memory made the abstract concept concrete.
- TDD isn’t just a best practice – it’s a thinking tool. Writing the test first forced me to articulate the exact behavior I wanted.
- B+‑trees vs B‑trees – now I understand why PostgreSQL uses B+‑trees for indexes (better range scans) while still appreciating the elegance of a pure B‑tree.
What’s Next?
I’m planning to dive into LSM‑trees (Log‑Structured Merge trees), which flip the trade‑off:
- B‑trees → optimize for reads (in‑place updates).
- LSM‑trees → optimize for writes (append‑only).
Implementing compaction strategies will be the next level of fun. 🤩
TL;DR
If you’re curious about database internals:
- Read Database Internals by Alex Petrov.
- Pick a data structure (B‑tree, LSM‑tree, etc.) and build it.
- Use TDD to lock down edge cases early.
The gap between reading and implementing is where the real magic happens.
You can see the full implementation on my self‑hosted GitLab (yes, I run my own instance: infra nerd life). Pull requests welcome if you spot ways to improve it! 😊