I Got Obsessed With 2-Byte Deltas and Built a Delta Compressor That Actually Hits Them
Source: Dev.to
TL;DR
Needed tiny deltas for an encrypted document format. xdelta3 uses 40+ bytes for simple edits. Built xpatch with 7 specialized algorithms and a tag system that references old versions. Result: median delta of 2 bytes on real code repositories (tested on 1.2 million real‑world changes). It’s fast and actually works.
The median delta is 2 bytes. Not a typo – that’s a uint16: the header plus literally nothing else because the files being compared are identical and the algorithm just says “hey, go look at version 5, it’s the same.”
The Problem That Broke Me
I’m building a document format with a wish‑list of requirements: end‑to‑end encryption, realtime collaboration, full version history, offline‑first sync, and all without a dedicated server farm.
The catch: you can’t compress after encryption. Once a delta is encrypted, there are no more compression opportunities, so every delta must be tiny or storage costs will kill the product.
- Myers diff – classic, well‑tested, but far too verbose for micro‑edits captured every 500 ms. Storing “user added 5 characters” took dozens of bytes.
- xdelta3 – better for larger changes, but still burned 40+ bytes just to encode “hello” → “hello world” (8× overhead on a 6‑byte change).
- gdelta – promising research paper (2021) with fast compression on small files, but the only implementation was three‑year‑old research code with Python bindings that barely compiled.
I rewrote gdelta in Rust (see the repo). It’s fast and efficient, but simple edits still produced ~20 bytes. I kept asking myself: Why do I need 20 bytes to store the fact that someone typed “hello”?
The Realization That Changed Everything
Most document edits follow predictable patterns:
- Adding text – “hello” → “hello world”
- Deleting text – “hello world” → “hello”
- Reverting changes – “hello” → “hello world” → “hello”
- Repeating patterns – “test test test”, indentation, decorators
- Complex structural changes – everything else
General‑purpose algorithms like xdelta3 or gdelta try to handle all cases with a single approach, optimizing for the worst case (arbitrary binary changes).
What if we didn’t need one algorithm for everything? What if we tried several specialized algorithms for each pattern and picked the smallest result? That idea led to xpatch.
Building It: The Simple Stuff First
Algorithm 1: Just Store The Damn Characters
The most common case is adding text, so we simply store the characters.
fn encode_add(position: usize, data: &[u8]) -> Vec<u8> {
let mut encoded = encode_varint(position);
encoded.extend_from_slice(data);
encoded
}
Position (variable‑length integer) + the raw characters. Adding “world” at position 6 costs about 6 bytes, already beating gdelta’s 20+ bytes.
Algorithm 2: Deletion Is Even Simpler
fn encode_remove(start: usize, end: usize) -> Vec<u8> {
let mut encoded = encode_varint(start);
encoded.extend(encode_varint(end - start));
encoded
}
Two numbers: where the deletion starts and how many bytes to remove. Typical overhead is ~3 bytes.
Algorithm 3: Wait, What If We Tokenize This?
Text isn’t random; it’s made of words. By tokenizing common words we can replace them with small IDs.
- Token list: 2¹⁴ = 16 384 tokens (fits a varint; most IDs fit in one byte).
- Tokens are sorted by frequency so common words get the smallest IDs.
- Trained on a large corpus (e.g., Wikipedia) to build a vocabulary.
The tokenizer uses a trie with array‑indexed children for O(1) per‑byte lookup:
struct TrieNode {
token_id: Option<u16>,
children: [Option<Box<TrieNode>>; 256],
}
impl TrieNode {
fn find_longest_match(&self, text: &[u8], start: usize) -> Option<(usize, u16)> {
let mut node = &self;
let mut last_match = None;
let mut pos = start;
while pos < text.len() {
let byte = text[pos];
match &node.children[byte as usize] {
Some(child) => {
node = child;
if let Some(id) = node.token_id {
last_match = Some((pos + 1, id));
}
pos += 1;
}
None => break,
}
}
last_match
}
}
Choosing the Best Algorithm
enum Algorithm {
Chars,
Tokens,
GDelta,
}
fn encode_delta(base: &[u8], new: &[u8]) -> (Algorithm, Vec<u8>) {
// Start with simple character encoding
let mut best_algo = Algorithm::Chars;
let mut best_data = encode_add(0, new); // placeholder for actual diff
// Try tokenization – might be smaller
if let Ok(token_data) = encode_tokens(0, new) {
if token_data.len() < best_data.len() {
best_algo = Algorithm::Tokens;
best_data = token_data;
}
}
// Fallback to gdelta for anything not covered above
let gdelta_data = gdelta_encode(base, new);
if gdelta_data.len() < best_data.len() {
best_algo = Algorithm::GDelta;
best_data = gdelta_data;
}
(best_algo, best_data)
}
The encoder always returns the smallest possible delta for the given change, leveraging the specialized algorithms when they win and gracefully falling back to a robust general‑purpose compressor otherwise.