Searchable JSON compression: page-level random access + ms lookups (and smaller than Zstd on our dataset)

Published: (February 19, 2026 at 02:12 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Cover image for Searchable JSON compression: page-level random access + ms lookups (and smaller than Zstd on our dataset)

Why this matters: the hidden “decompress+parse tax”

If you store NDJSON as zstd, most queries still pay:

  • read large chunks
  • decompress everything
  • parse JSON
  • scan for the field/value you need

Even when the data size is modest, the CPU + I/O pattern becomes brutal at scale.

SEE targets workloads where you repeatedly need:

  • exists / pos / eq‑style queries
  • random access
  • low latency without full decompression

What SEE is (in 60 seconds)

SEE is a page‑based, schema‑aware format:

  • page‑level layout for random access
  • Bloom + skip to avoid touching irrelevant pages (high skip rate)
  • schema‑aware encoding (structure + deltas + dictionary where useful)

It is designed to reduce both:

  • data tax (storage/egress)
  • CPU tax (decompress/parse)

The trade‑off is that SEE optimizes for low I/O and low latency, not always the absolute smallest size (though it can win on size too, depending on the dataset).

KPI snapshot (public demo)

These are the numbers published from the demo pack:

  • Combined size ratio: ≈ 19.5 % of raw
  • Lookup latency (present): p50 ≈ 0.18 ms / p95 ≈ 0.28 ms / p99 ≈ 0.34 ms
  • Skip ratio: present ≈ 0.99 / absent ≈ 0.992
  • Bloom density: ≈ 0.30

“Combined” is the total footprint for the SEE artifact on the benchmarked dataset.

KPI chart

Proof‑first distribution (so you can verify without meetings)

I intentionally ship reproducible packs:

  1. Demo ZIP (≈10 min)

    • prebuilt wheel + sample .see artifacts
    • demo scripts that print KPIs (ratio/skip/bloom/p50–p99)
    • One‑pager PDF
  2. DD Pack (audit / repro artifacts)

    • run summaries + run_metrics.json
    • verification checklist (pack_verify.txt)
    • designed for technical diligence

Recent robustness milestone: strict decode‑mismatch checks across multiple datasets = 0 (decode_mismatch_count=0, decode_extended_mismatch_count=0, audit PASS).

Quick start (demo)

pip install see_proto
python samples/quick_demo.py

The script prints:

  • compression ratio
  • skip/bloom statistics
  • lookup latency (p50/p95/p99)
  • GitHub repo:
  • Release (v0.1.1):

For a formal evaluation under NDA (DD pack / deeper materials):

Note: company email is preferred, but DMs are welcome too (no confidential data needed at first contact).

What I’m looking for

SEE is not a SaaS product. I’m exploring strategic acquisition or an exclusive license with teams that have a clear integration path.

To keep evaluation high‑signal, I run up to a small number of NDA evals per month. If you’re on a data platform / infra / storage team and can see where this fits, I’d love to hear from you.

0 views
Back to Blog

Related posts

Read more »

Apex B. OpenClaw, Local Embeddings.

Local Embeddings para Private Memory Search Por default, el memory search de OpenClaw envía texto a un embedding API externo típicamente Anthropic u OpenAI par...