GXD: Rethinking File Compression for Modern Computing
Source: Dev.to
GXD – A Modern, Parallel‑First Compression Utility
Community‑driven, open‑source (GPL‑3.0), currently in alpha (v0.0.0a2)
The Problem
- Single‑core legacy design – Traditional tools compress/decompress data sequentially.
- Bottlenecks – Extracting a 1 MiB slice from a 10 GiB archive forces a full decompression.
- Under‑utilised hardware – Even on a 16‑core workstation most compressors use only one core.
GXD’s Core Idea
Instead of treating a file as one monolithic stream, GXD splits it into independent blocks.
Each block can be compressed and decompressed in parallel, turning compression into a truly scalable operation.
Architecture & Advantages
| Feature | What it means | Benefit |
|---|---|---|
| Block‑based design | Files are divided into configurable blocks before compression. | Enables parallel processing and random‑access extraction. |
| True parallelism | Work is automatically distributed across all available CPU cores (via ProcessPoolExecutor). | Near‑linear speed‑up vs. single‑threaded tools. |
| Random‑access extraction | Only the blocks that contain the requested byte range are decompressed. | Partial extraction is orders of magnitude faster. |
| Per‑block integrity | SHA‑256 checksum stored for every block. | Verify data integrity without full extraction; optional for speed‑critical runs. |
Supported Compression Algorithms
| Algorithm | Typical use‑case | Trade‑off |
|---|---|---|
| Zstandard | Balanced speed & compression (default) | General purpose |
| LZ4 | Maximum speed, low latency | Slightly lower compression ratio |
| Brotli | Highest compression ratio | Slower than Zstd/LZ4 |
| None | Pure block storage & integrity verification | No compression, fastest I/O |
Block Sizing
- Small blocks – Optimise random access, reduce compression ratio.
- Medium blocks – Good balance for most workflows (default).
- Large blocks – Maximise compression ratio, reduce parallelism & random‑access speed.
Real‑World Use Cases
Log‑file analysis
System admins often need only the most recent entries from massive compressed logs.
- Traditional: Decompress gigabytes of history.
- GXD: Seek directly to the last hour; decompress ~100 MiB instead of 10 GiB.
Research datasets
Scientists compress terabyte‑scale data (e.g., genome sequences).
- Compress: Use 16 threads for speed.
- Extract: Pull specific chromosome ranges in seconds, not hours.
Backup verification
Verifying multi‑terabyte backups is impractical with classic tools.
- GXD: Block‑level SHA‑256 checksums let you confirm integrity without extracting any data.
Archive Format
[Magic number] ──> [Compressed block 0] ──> … ──> [Compressed block N]
└───────────────────────► [JSON metadata] ◄───────────────────────┘
- Magic number – Quick file identification.
- Compressed blocks – Stored sequentially.
- JSON metadata – Archive version, algorithm per block, block offsets & sizes, SHA‑256 checksums.
The self‑describing format enables efficient seeking and future‑proof compatibility.
Implementation Details
- Language: Python 3.x
- Parallelism:
concurrent.futures.ProcessPoolExecutor - Compression libraries:
zstandard,lz4.frame,brotli(via PyPI wheels) - Progress UI:
tqdm(graceful fallback if unavailable)
Development Philosophy
- Community‑driven – Roadmap shaped by users and contributors.
- Open source – GPL‑3.0 guarantees freedom to use, modify, and redistribute.
Project Status
- Version:
0.0.0a2(alpha) - Core functionality – Stable compression/decompression, block‑level verification, CLI interface.
- Feedback welcome – API design, feature set, documentation, and bug reports are actively solicited.
Testing
- Comprehensive test suite covering full compression‑decompression cycles, corruption detection, edge‑case handling (empty files, very large blocks, etc.).
- Caution – As an alpha release, run extensive validation before production use.
Roadmap (Ideas & Potential Directions)
- Additional algorithms: LZMA, Zlib.
- Encryption: Secure archives for sensitive data.
- Multi‑file archives: Replace
tar‑style preprocessing. - Incremental compression: Efficient backup workflows.
- GUI: Friendly interface for non‑technical users.
- Language bindings: Rust, Go, C/C++ wrappers.
Which of these features should be prioritised?
Get Involved
- Report bugs / suggest features: Open an issue on the GitHub repository.
- Contribute code: Fork, implement, and submit a pull request.
- Improve docs: Help make the project more approachable for newcomers.
Together we can reshape how data is compressed, verified, and accessed in the multi‑core era.
Overview
GXD is an alpha‑stage, Python‑based file‑compression tool that emphasizes parallel processing, random‑access reads, and built‑in integrity verification. It is designed for modern hardware and workflows rather than the constraints of legacy utilities.
Requirements
- Python: 3.6 or later
- Supported OS: Linux, macOS, Windows, BSD
Optional dependencies
| Package | Purpose |
|---|---|
zstandard | Zstandard compression algorithm |
lz4 | LZ4 compression algorithm |
brotli | Brotli compression algorithm |
tqdm | Progress‑bar display |
Install only the algorithms you plan to use.
Installation
# Core installation (Python 3.6+ required)
pip install gxd
# Install optional compression algorithms as needed
pip install zstandard lz4 brotli tqdm
Basic Usage
| Action | Command |
|---|---|
| Compress | gxd compress <input> <output> |
| Decompress | gxd decompress <archive> <output> |
| Seek (random access) | gxd seek <archive> <byte-range> |
Advanced options let you control algorithm selection, block size, thread count, and verification behavior.
Performance Tuning
- Choose the right algorithm – match your speed vs. compression‑ratio needs.
- Adjust block size – smaller blocks improve random access, larger blocks improve compression ratio.
- Set thread count –
--threads Nto limit or expand parallelism based on CPU cores. - Enable/disable verification –
--verifyfor integrity checks; omit for fastest throughput.