GXD: Rethinking File Compression for Modern Computing

Published: 1 week ago (December 19, 2025 at 04:53 AM EST)

4 min read

Source: Dev.to

Source: Dev.to

GXD – A Modern, Parallel‑First Compression Utility

Community‑driven, open‑source (GPL‑3.0), currently in alpha (v0.0.0a2)

The Problem

Single‑core legacy design – Traditional tools compress/decompress data sequentially.
Bottlenecks – Extracting a 1 MiB slice from a 10 GiB archive forces a full decompression.
Under‑utilised hardware – Even on a 16‑core workstation most compressors use only one core.

GXD’s Core Idea

Instead of treating a file as one monolithic stream, GXD splits it into independent blocks.
Each block can be compressed and decompressed in parallel, turning compression into a truly scalable operation.

Architecture & Advantages

Feature	What it means	Benefit
Block‑based design	Files are divided into configurable blocks before compression.	Enables parallel processing and random‑access extraction.
True parallelism	Work is automatically distributed across all available CPU cores (via `ProcessPoolExecutor`).	Near‑linear speed‑up vs. single‑threaded tools.
Random‑access extraction	Only the blocks that contain the requested byte range are decompressed.	Partial extraction is orders of magnitude faster.
Per‑block integrity	SHA‑256 checksum stored for every block.	Verify data integrity without full extraction; optional for speed‑critical runs.

Supported Compression Algorithms

Algorithm	Typical use‑case	Trade‑off
Zstandard	Balanced speed & compression (default)	General purpose
LZ4	Maximum speed, low latency	Slightly lower compression ratio
Brotli	Highest compression ratio	Slower than Zstd/LZ4
None	Pure block storage & integrity verification	No compression, fastest I/O

Block Sizing

Small blocks – Optimise random access, reduce compression ratio.
Medium blocks – Good balance for most workflows (default).
Large blocks – Maximise compression ratio, reduce parallelism & random‑access speed.

Real‑World Use Cases

Log‑file analysis

System admins often need only the most recent entries from massive compressed logs.

Traditional: Decompress gigabytes of history.
GXD: Seek directly to the last hour; decompress ~100 MiB instead of 10 GiB.

Research datasets

Scientists compress terabyte‑scale data (e.g., genome sequences).

Compress: Use 16 threads for speed.
Extract: Pull specific chromosome ranges in seconds, not hours.

Backup verification

Verifying multi‑terabyte backups is impractical with classic tools.

GXD: Block‑level SHA‑256 checksums let you confirm integrity without extracting any data.

Archive Format

[Magic number] ──> [Compressed block 0] ──> … ──> [Compressed block N]
                 └───────────────────────► [JSON metadata] ◄───────────────────────┘

Magic number – Quick file identification.
Compressed blocks – Stored sequentially.
JSON metadata – Archive version, algorithm per block, block offsets & sizes, SHA‑256 checksums.

The self‑describing format enables efficient seeking and future‑proof compatibility.

Implementation Details

Language: Python 3.x
Parallelism: concurrent.futures.ProcessPoolExecutor
Compression libraries: zstandard, lz4.frame, brotli (via PyPI wheels)
Progress UI: tqdm (graceful fallback if unavailable)

Development Philosophy

Community‑driven – Roadmap shaped by users and contributors.
Open source – GPL‑3.0 guarantees freedom to use, modify, and redistribute.

Project Status

Version: 0.0.0a2 (alpha)
Core functionality – Stable compression/decompression, block‑level verification, CLI interface.
Feedback welcome – API design, feature set, documentation, and bug reports are actively solicited.

Testing

Comprehensive test suite covering full compression‑decompression cycles, corruption detection, edge‑case handling (empty files, very large blocks, etc.).
Caution – As an alpha release, run extensive validation before production use.

Roadmap (Ideas & Potential Directions)

Additional algorithms: LZMA, Zlib.
Encryption: Secure archives for sensitive data.
Multi‑file archives: Replace tar‑style preprocessing.
Incremental compression: Efficient backup workflows.
GUI: Friendly interface for non‑technical users.
Language bindings: Rust, Go, C/C++ wrappers.

Which of these features should be prioritised?

Get Involved

Report bugs / suggest features: Open an issue on the GitHub repository.
Contribute code: Fork, implement, and submit a pull request.
Improve docs: Help make the project more approachable for newcomers.

Together we can reshape how data is compressed, verified, and accessed in the multi‑core era.

Overview

GXD is an alpha‑stage, Python‑based file‑compression tool that emphasizes parallel processing, random‑access reads, and built‑in integrity verification. It is designed for modern hardware and workflows rather than the constraints of legacy utilities.

Requirements

Python: 3.6 or later
Supported OS: Linux, macOS, Windows, BSD

Optional dependencies

Package	Purpose
`zstandard`	Zstandard compression algorithm
`lz4`	LZ4 compression algorithm
`brotli`	Brotli compression algorithm
`tqdm`	Progress‑bar display

Install only the algorithms you plan to use.

Installation

# Core installation (Python 3.6+ required)
pip install gxd

# Install optional compression algorithms as needed
pip install zstandard lz4 brotli tqdm

Basic Usage

Action	Command
Compress	`gxd compress <input> <output>`
Decompress	`gxd decompress <archive> <output>`
Seek (random access)	`gxd seek <archive> <byte-range>`

Advanced options let you control algorithm selection, block size, thread count, and verification behavior.

Performance Tuning

Choose the right algorithm – match your speed vs. compression‑ratio needs.
Adjust block size – smaller blocks improve random access, larger blocks improve compression ratio.
Set thread count – --threads N to limit or expand parallelism based on CPU cores.
Enable/disable verification – --verify for integrity checks; omit for fastest throughput.