GXD: Rethinking File Compression for Modern Computing

Published: (December 19, 2025 at 04:53 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

GXD – A Modern, Parallel‑First Compression Utility

Community‑driven, open‑source (GPL‑3.0), currently in alpha (v0.0.0a2)

The Problem

  • Single‑core legacy design – Traditional tools compress/decompress data sequentially.
  • Bottlenecks – Extracting a 1 MiB slice from a 10 GiB archive forces a full decompression.
  • Under‑utilised hardware – Even on a 16‑core workstation most compressors use only one core.

GXD’s Core Idea

Instead of treating a file as one monolithic stream, GXD splits it into independent blocks.
Each block can be compressed and decompressed in parallel, turning compression into a truly scalable operation.

Architecture & Advantages

FeatureWhat it meansBenefit
Block‑based designFiles are divided into configurable blocks before compression.Enables parallel processing and random‑access extraction.
True parallelismWork is automatically distributed across all available CPU cores (via ProcessPoolExecutor).Near‑linear speed‑up vs. single‑threaded tools.
Random‑access extractionOnly the blocks that contain the requested byte range are decompressed.Partial extraction is orders of magnitude faster.
Per‑block integritySHA‑256 checksum stored for every block.Verify data integrity without full extraction; optional for speed‑critical runs.

Supported Compression Algorithms

AlgorithmTypical use‑caseTrade‑off
ZstandardBalanced speed & compression (default)General purpose
LZ4Maximum speed, low latencySlightly lower compression ratio
BrotliHighest compression ratioSlower than Zstd/LZ4
NonePure block storage & integrity verificationNo compression, fastest I/O

Block Sizing

  • Small blocks – Optimise random access, reduce compression ratio.
  • Medium blocks – Good balance for most workflows (default).
  • Large blocks – Maximise compression ratio, reduce parallelism & random‑access speed.

Real‑World Use Cases

Log‑file analysis

System admins often need only the most recent entries from massive compressed logs.

  • Traditional: Decompress gigabytes of history.
  • GXD: Seek directly to the last hour; decompress ~100 MiB instead of 10 GiB.

Research datasets

Scientists compress terabyte‑scale data (e.g., genome sequences).

  • Compress: Use 16 threads for speed.
  • Extract: Pull specific chromosome ranges in seconds, not hours.

Backup verification

Verifying multi‑terabyte backups is impractical with classic tools.

  • GXD: Block‑level SHA‑256 checksums let you confirm integrity without extracting any data.

Archive Format

[Magic number] ──> [Compressed block 0] ──> … ──> [Compressed block N]
                 └───────────────────────► [JSON metadata] ◄───────────────────────┘
  • Magic number – Quick file identification.
  • Compressed blocks – Stored sequentially.
  • JSON metadata – Archive version, algorithm per block, block offsets & sizes, SHA‑256 checksums.

The self‑describing format enables efficient seeking and future‑proof compatibility.

Implementation Details

  • Language: Python 3.x
  • Parallelism: concurrent.futures.ProcessPoolExecutor
  • Compression libraries: zstandard, lz4.frame, brotli (via PyPI wheels)
  • Progress UI: tqdm (graceful fallback if unavailable)

Development Philosophy

  • Community‑driven – Roadmap shaped by users and contributors.
  • Open source – GPL‑3.0 guarantees freedom to use, modify, and redistribute.

Project Status

  • Version: 0.0.0a2 (alpha)
  • Core functionality – Stable compression/decompression, block‑level verification, CLI interface.
  • Feedback welcome – API design, feature set, documentation, and bug reports are actively solicited.

Testing

  • Comprehensive test suite covering full compression‑decompression cycles, corruption detection, edge‑case handling (empty files, very large blocks, etc.).
  • Caution – As an alpha release, run extensive validation before production use.

Roadmap (Ideas & Potential Directions)

  • Additional algorithms: LZMA, Zlib.
  • Encryption: Secure archives for sensitive data.
  • Multi‑file archives: Replace tar‑style preprocessing.
  • Incremental compression: Efficient backup workflows.
  • GUI: Friendly interface for non‑technical users.
  • Language bindings: Rust, Go, C/C++ wrappers.

Which of these features should be prioritised?

Get Involved

  • Report bugs / suggest features: Open an issue on the GitHub repository.
  • Contribute code: Fork, implement, and submit a pull request.
  • Improve docs: Help make the project more approachable for newcomers.

Together we can reshape how data is compressed, verified, and accessed in the multi‑core era.

Overview

GXD is an alpha‑stage, Python‑based file‑compression tool that emphasizes parallel processing, random‑access reads, and built‑in integrity verification. It is designed for modern hardware and workflows rather than the constraints of legacy utilities.

Requirements

  • Python: 3.6 or later
  • Supported OS: Linux, macOS, Windows, BSD

Optional dependencies

PackagePurpose
zstandardZstandard compression algorithm
lz4LZ4 compression algorithm
brotliBrotli compression algorithm
tqdmProgress‑bar display

Install only the algorithms you plan to use.

Installation

# Core installation (Python 3.6+ required)
pip install gxd

# Install optional compression algorithms as needed
pip install zstandard lz4 brotli tqdm

Basic Usage

ActionCommand
Compressgxd compress <input> <output>
Decompressgxd decompress <archive> <output>
Seek (random access)gxd seek <archive> <byte-range>

Advanced options let you control algorithm selection, block size, thread count, and verification behavior.

Performance Tuning

  1. Choose the right algorithm – match your speed vs. compression‑ratio needs.
  2. Adjust block size – smaller blocks improve random access, larger blocks improve compression ratio.
  3. Set thread count--threads N to limit or expand parallelism based on CPU cores.
  4. Enable/disable verification--verify for integrity checks; omit for fastest throughput.
Back to Blog

Related posts

Read more »

The leetcode comfort trap

The Comfort Loop Solving 2–3 LeetCode problems and going to sleep feeling accomplished is the same dopamine loop as hitting the gym, training hard, and going h...