[Paper] ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution

Published: 1 week ago (June 2, 2026 at 06:44 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.04268v1

Overview

The paper introduces ACEAPEX, a new approach to LZ77‑based decompression that eliminates the classic sequential bottleneck by turning every back‑reference into an absolute offset in the final output. By packing data into self‑contained 1 MiB blocks, ACEAPEX can decode each block completely independently, unlocking massive parallelism on CPUs and GPUs. The authors demonstrate multi‑core speed‑ups of over 3× on modern AMD EPYC CPUs and record‑breaking GPU decode rates that retain standard‑grade compression ratios.

Key Contributions

Absolute‑offset encoding: Stores back‑references as positions in the decompressed stream rather than relative to the current decode pointer.
Self‑contained 1 MiB blocks: Each block contains all information needed for decoding, enabling embarrassingly parallel processing.
CPU implementation: Integrated into the popular lzbench suite, achieving up to 10.9 GB/s on 8‑core EPYC CPUs for FASTQ genomic data.
GPU wavefront decoder: A novel wavefront‑style algorithm for NVIDIA H100 GPUs that reaches 44 GB/s (single‑GPU) and 77 GB/s (depth‑limited encoder) on large text corpora, with a verified bit‑perfect output.
Compression‑ratio parity: Maintains compression ratios comparable to state‑of‑the‑art zstd -3, with only a 1.5 % penalty in the depth‑limited variant.

Methodology

Encode‑time absolute offset resolution – During compression, the encoder computes the exact position each back‑reference will occupy in the final decompressed file and stores that absolute address instead of a relative distance.
Block partitioning – The input stream is split into fixed‑size 1 MiB blocks. Each block’s header records the absolute offsets of its own back‑references, guaranteeing that no reference points outside the block.
Parallel decoding on CPUs – Because blocks are independent, a thread pool can assign each block to a core. Decoding proceeds with a simple copy‑loop; no synchronization or look‑ahead is required.
GPU wavefront decoder – The GPU version maps each block to a warp (32 threads). A “wavefront match” phase resolves matches in parallel, while a subsequent copy phase writes the literal and match data to the output buffer. The algorithm respects data dependencies by using a depth‑first ordering that fits the GPU’s massive thread parallelism.
Verification – The authors run a byte‑for‑byte comparison against a reference decoder to guarantee lossless correctness (BIT‑PERFECT).

Results & Findings

Platform	Dataset	Throughput	Compression Ratio (vs. zstd -3)
AMD EPYC 4344P (8 cores)	FASTQ (genomics)	10,160 MB/s	≈ same
AMD EPYC 9575F (8 cores)	FASTQ	10,869 MB/s	≈ same
NVIDIA H100 (single)	enwik9 (text)	44.0 GB/s (decode)	–
NVIDIA H100 (single, depth‑limited)	enwik9	77.2 GB/s	‑1.5 % ratio loss
2 × H100 (NVLink)	enwik9	249.9 GB/s	–
NVIDIA H100 (single)	FASTQ	20.3 GB/s (wavefront match)	–

Speedup vs. zstd -3: Up to 3.1× faster on CPUs while keeping comparable compression.
GPU scaling: Near‑linear scaling when adding a second H100 via NVLink, demonstrating the algorithm’s suitability for high‑throughput data pipelines.
Correctness: All outputs match a reference decoder byte‑for‑byte, confirming that the aggressive parallelism does not compromise losslessness.

Practical Implications

High‑throughput genomics pipelines – FASTQ files often dominate I/O; ACEAPEX can decompress them >10 GB/s on a single server, dramatically reducing bottlenecks in variant‑calling or alignment stages.
Big‑data analytics – Textual logs (e.g., web crawls, telemetry) can be streamed from storage to compute nodes at >40 GB/s per GPU, enabling real‑time analytics on petabyte‑scale archives.
Edge‑to‑cloud data transfer – The block‑based format works well with distributed storage systems (e.g., object stores, HDFS) because each block is self‑contained, allowing parallel fetches without coordination.
GPU‑accelerated workflows – Existing GPU‑centric pipelines (deep‑learning preprocessing, video transcoding) can now offload LZ77 decompression to the same accelerator, freeing CPU cycles.
Library integration – Since ACEAPEX plugs into lzbench, developers can adopt it with minimal code changes, and the approach could be ported to other compression libraries (e.g., libdeflate, zstd).

Limitations & Future Work

Block size trade‑off – Fixed 1 MiB blocks simplify independence but may slightly inflate the compressed size for highly repetitive data where longer matches cross block boundaries.
Encoder overhead – Computing absolute offsets adds modest CPU work during compression; the paper focuses on decode speed, so encoder performance could be further optimized.
GPU memory pressure – The wavefront decoder requires enough GPU memory to hold multiple blocks and auxiliary buffers; extremely large datasets may need careful chunking.
Broader codec support – The technique is demonstrated for LZ77; extending absolute‑offset encoding to newer dictionary‑based codecs (e.g., Zstandard’s sequences) is an open research direction.

Bottom line: ACEAPEX shows that by rethinking how back‑references are stored, we can finally break the long‑standing sequential wall of LZ77 decoding, delivering order‑of‑magnitude speedups on both CPUs and GPUs without sacrificing compression quality. This opens the door for faster data‑intensive pipelines across genomics, big‑data analytics, and any domain that still relies on LZ77‑based formats.

Authors

Yakiv Shavidze

Paper Information

arXiv ID: 2606.04268v1
Categories: cs.DC
Published: June 2, 2026
PDF: Download PDF

[Paper] ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

[Paper] Predictive Autoscaling in Cloud-Native and Federated Cloud-Edge Computing Environments: A Taxonomy and Future Directions

[Paper] PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

[Paper] Mission-Level Runtime Assurance Framework for Autonomous Driving