[Paper] ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution

Published: (June 2, 2026 at 06:44 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.04268v1

Overview

The paper introduces ACEAPEX, a new approach to LZ77‑based decompression that eliminates the classic sequential bottleneck by turning every back‑reference into an absolute offset in the final output. By packing data into self‑contained 1 MiB blocks, ACEAPEX can decode each block completely independently, unlocking massive parallelism on CPUs and GPUs. The authors demonstrate multi‑core speed‑ups of over 3× on modern AMD EPYC CPUs and record‑breaking GPU decode rates that retain standard‑grade compression ratios.

Key Contributions

  • Absolute‑offset encoding: Stores back‑references as positions in the decompressed stream rather than relative to the current decode pointer.
  • Self‑contained 1 MiB blocks: Each block contains all information needed for decoding, enabling embarrassingly parallel processing.
  • CPU implementation: Integrated into the popular lzbench suite, achieving up to 10.9 GB/s on 8‑core EPYC CPUs for FASTQ genomic data.
  • GPU wavefront decoder: A novel wavefront‑style algorithm for NVIDIA H100 GPUs that reaches 44 GB/s (single‑GPU) and 77 GB/s (depth‑limited encoder) on large text corpora, with a verified bit‑perfect output.
  • Compression‑ratio parity: Maintains compression ratios comparable to state‑of‑the‑art zstd -3, with only a 1.5 % penalty in the depth‑limited variant.

Methodology

  1. Encode‑time absolute offset resolution – During compression, the encoder computes the exact position each back‑reference will occupy in the final decompressed file and stores that absolute address instead of a relative distance.
  2. Block partitioning – The input stream is split into fixed‑size 1 MiB blocks. Each block’s header records the absolute offsets of its own back‑references, guaranteeing that no reference points outside the block.
  3. Parallel decoding on CPUs – Because blocks are independent, a thread pool can assign each block to a core. Decoding proceeds with a simple copy‑loop; no synchronization or look‑ahead is required.
  4. GPU wavefront decoder – The GPU version maps each block to a warp (32 threads). A “wavefront match” phase resolves matches in parallel, while a subsequent copy phase writes the literal and match data to the output buffer. The algorithm respects data dependencies by using a depth‑first ordering that fits the GPU’s massive thread parallelism.
  5. Verification – The authors run a byte‑for‑byte comparison against a reference decoder to guarantee lossless correctness (BIT‑PERFECT).

Results & Findings

PlatformDatasetThroughputCompression Ratio (vs. zstd -3)
AMD EPYC 4344P (8 cores)FASTQ (genomics)10,160 MB/s≈ same
AMD EPYC 9575F (8 cores)FASTQ10,869 MB/s≈ same
NVIDIA H100 (single)enwik9 (text)44.0 GB/s (decode)
NVIDIA H100 (single, depth‑limited)enwik977.2 GB/s‑1.5 % ratio loss
2 × H100 (NVLink)enwik9249.9 GB/s
NVIDIA H100 (single)FASTQ20.3 GB/s (wavefront match)
  • Speedup vs. zstd -3: Up to 3.1× faster on CPUs while keeping comparable compression.
  • GPU scaling: Near‑linear scaling when adding a second H100 via NVLink, demonstrating the algorithm’s suitability for high‑throughput data pipelines.
  • Correctness: All outputs match a reference decoder byte‑for‑byte, confirming that the aggressive parallelism does not compromise losslessness.

Practical Implications

  • High‑throughput genomics pipelines – FASTQ files often dominate I/O; ACEAPEX can decompress them >10 GB/s on a single server, dramatically reducing bottlenecks in variant‑calling or alignment stages.
  • Big‑data analytics – Textual logs (e.g., web crawls, telemetry) can be streamed from storage to compute nodes at >40 GB/s per GPU, enabling real‑time analytics on petabyte‑scale archives.
  • Edge‑to‑cloud data transfer – The block‑based format works well with distributed storage systems (e.g., object stores, HDFS) because each block is self‑contained, allowing parallel fetches without coordination.
  • GPU‑accelerated workflows – Existing GPU‑centric pipelines (deep‑learning preprocessing, video transcoding) can now offload LZ77 decompression to the same accelerator, freeing CPU cycles.
  • Library integration – Since ACEAPEX plugs into lzbench, developers can adopt it with minimal code changes, and the approach could be ported to other compression libraries (e.g., libdeflate, zstd).

Limitations & Future Work

  • Block size trade‑off – Fixed 1 MiB blocks simplify independence but may slightly inflate the compressed size for highly repetitive data where longer matches cross block boundaries.
  • Encoder overhead – Computing absolute offsets adds modest CPU work during compression; the paper focuses on decode speed, so encoder performance could be further optimized.
  • GPU memory pressure – The wavefront decoder requires enough GPU memory to hold multiple blocks and auxiliary buffers; extremely large datasets may need careful chunking.
  • Broader codec support – The technique is demonstrated for LZ77; extending absolute‑offset encoding to newer dictionary‑based codecs (e.g., Zstandard’s sequences) is an open research direction.

Bottom line: ACEAPEX shows that by rethinking how back‑references are stored, we can finally break the long‑standing sequential wall of LZ77 decoding, delivering order‑of‑magnitude speedups on both CPUs and GPUs without sacrificing compression quality. This opens the door for faster data‑intensive pipelines across genomics, big‑data analytics, and any domain that still relies on LZ77‑based formats.

Authors

  • Yakiv Shavidze

Paper Information

  • arXiv ID: 2606.04268v1
  • Categories: cs.DC
  • Published: June 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »