I Got Obsessed With 2-Byte Deltas and Built a Delta Compressor That Actually Hits Them

Published: 2 days ago (December 11, 2025 at 05:43 PM EST)

4 min read

Source: Dev.to

TL;DR

Needed tiny deltas for an encrypted document format. xdelta3 uses 40+ bytes for simple edits. Built xpatch with 7 specialized algorithms and a tag system that references old versions. Result: median delta of 2 bytes on real code repositories (tested on 1.2 million real‑world changes). It’s fast and actually works.

The median delta is 2 bytes. Not a typo – that’s a uint16: the header plus literally nothing else because the files being compared are identical and the algorithm just says “hey, go look at version 5, it’s the same.”

The Problem That Broke Me

I’m building a document format with a wish‑list of requirements: end‑to‑end encryption, realtime collaboration, full version history, offline‑first sync, and all without a dedicated server farm.

The catch: you can’t compress after encryption. Once a delta is encrypted, there are no more compression opportunities, so every delta must be tiny or storage costs will kill the product.

Myers diff – classic, well‑tested, but far too verbose for micro‑edits captured every 500 ms. Storing “user added 5 characters” took dozens of bytes.
xdelta3 – better for larger changes, but still burned 40+ bytes just to encode “hello” → “hello world” (8× overhead on a 6‑byte change).
gdelta – promising research paper (2021) with fast compression on small files, but the only implementation was three‑year‑old research code with Python bindings that barely compiled.

I rewrote gdelta in Rust (see the repo). It’s fast and efficient, but simple edits still produced ~20 bytes. I kept asking myself: Why do I need 20 bytes to store the fact that someone typed “hello”?

The Realization That Changed Everything

Most document edits follow predictable patterns:

Adding text – “hello” → “hello world”
Deleting text – “hello world” → “hello”
Reverting changes – “hello” → “hello world” → “hello”
Repeating patterns – “test test test”, indentation, decorators
Complex structural changes – everything else

General‑purpose algorithms like xdelta3 or gdelta try to handle all cases with a single approach, optimizing for the worst case (arbitrary binary changes).

What if we didn’t need one algorithm for everything? What if we tried several specialized algorithms for each pattern and picked the smallest result? That idea led to xpatch.

Building It: The Simple Stuff First

Algorithm 1: Just Store The Damn Characters

The most common case is adding text, so we simply store the characters.

fn encode_add(position: usize, data: &[u8]) -> Vec<u8> {
    let mut encoded = encode_varint(position);
    encoded.extend_from_slice(data);
    encoded
}

Position (variable‑length integer) + the raw characters. Adding “world” at position 6 costs about 6 bytes, already beating gdelta’s 20+ bytes.

Algorithm 2: Deletion Is Even Simpler

fn encode_remove(start: usize, end: usize) -> Vec<u8> {
    let mut encoded = encode_varint(start);
    encoded.extend(encode_varint(end - start));
    encoded
}

Two numbers: where the deletion starts and how many bytes to remove. Typical overhead is ~3 bytes.

Algorithm 3: Wait, What If We Tokenize This?

Text isn’t random; it’s made of words. By tokenizing common words we can replace them with small IDs.

Token list: 2¹⁴ = 16 384 tokens (fits a varint; most IDs fit in one byte).
Tokens are sorted by frequency so common words get the smallest IDs.
Trained on a large corpus (e.g., Wikipedia) to build a vocabulary.

The tokenizer uses a trie with array‑indexed children for O(1) per‑byte lookup:

struct TrieNode {
    token_id: Option<u16>,
    children: [Option<Box<TrieNode>>; 256],
}

impl TrieNode {
    fn find_longest_match(&self, text: &[u8], start: usize) -> Option<(usize, u16)> {
        let mut node = &self;
        let mut last_match = None;
        let mut pos = start;

        while pos < text.len() {
            let byte = text[pos];
            match &node.children[byte as usize] {
                Some(child) => {
                    node = child;
                    if let Some(id) = node.token_id {
                        last_match = Some((pos + 1, id));
                    }
                    pos += 1;
                }
                None => break,
            }
        }
        last_match
    }
}

Choosing the Best Algorithm

enum Algorithm {
    Chars,
    Tokens,
    GDelta,
}

fn encode_delta(base: &[u8], new: &[u8]) -> (Algorithm, Vec<u8>) {
    // Start with simple character encoding
    let mut best_algo = Algorithm::Chars;
    let mut best_data = encode_add(0, new); // placeholder for actual diff

    // Try tokenization – might be smaller
    if let Ok(token_data) = encode_tokens(0, new) {
        if token_data.len() < best_data.len() {
            best_algo = Algorithm::Tokens;
            best_data = token_data;
        }
    }

    // Fallback to gdelta for anything not covered above
    let gdelta_data = gdelta_encode(base, new);
    if gdelta_data.len() < best_data.len() {
        best_algo = Algorithm::GDelta;
        best_data = gdelta_data;
    }

    (best_algo, best_data)
}

The encoder always returns the smallest possible delta for the given change, leveraging the specialized algorithms when they win and gracefully falling back to a robust general‑purpose compressor otherwise.

I Got Obsessed With 2-Byte Deltas and Built a Delta Compressor That Actually Hits Them

TL;DR

The Problem That Broke Me

The Realization That Changed Everything

Building It: The Simple Stuff First

Algorithm 1: Just Store The Damn Characters

Algorithm 2: Deletion Is Even Simpler

Algorithm 3: Wait, What If We Tokenize This?

Choosing the Best Algorithm

Related posts

How git clone Really Works: A Deep Dive into Git’s Object Database

Git Golden Rules

5 Must-Read Books for Backend Engineers in 20226

First Contributions: learn how to contribute to open source projects

TL;DR

The Problem That Broke Me

The Realization That Changed Everything

Building It: The Simple Stuff First

Algorithm 1: Just Store The Damn Characters

Algorithm 2: Deletion Is Even Simpler

Algorithm 3: Wait, What If We Tokenize This?

Choosing the Best Algorithm

Related posts

How git clone Really Works: A Deep Dive into Git’s Object Database

Git Golden Rules

5 Must-Read Books for Backend Engineers in 20226

First Contributions: learn how to contribute to open source projects

Algorithm 1: Just Store The Damn Characters

Algorithm 2: Deletion Is Even Simpler

Algorithm 3: Wait, What If We Tokenize This?