[Paper] ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
Source: arXiv - 2603.17435v1
Overview
The paper introduces ZipServ, a lossless compression system that makes serving large language models (LLMs) on GPUs both smaller and faster. By redesigning the compression format and the GPU kernels that consume it, the authors achieve up to a 30 % reduction in model size and measurable inference speed‑ups—something that most prior “bit‑exact” compressors could not deliver.
Key Contributions
- Tensor‑Core‑Aware Triple Bitmap Encoding (TCA‑TBE) – a fixed‑length, bitmap‑based representation that can be decoded in constant time and maps cleanly onto NVIDIA Tensor Cores.
- ZipGEMM kernel – a fused “decompress‑and‑multiply” kernel that streams compressed weights directly into Tensor‑Core registers, eliminating intermediate buffers.
- Hardware‑aware co‑design – the compression format and the compute kernel are built together, preserving SIMT parallelism and avoiding extra memory traffic.
- Empirical gains – up to 30 % model size reduction, 2.21× kernel‑level speedup over cuBLAS, and 1.22× end‑to‑end inference acceleration on the popular vLLM serving stack.
- First lossless system that simultaneously delivers storage savings and inference acceleration for LLMs on GPUs.
Methodology
-
Encoding design – Traditional entropy coders (e.g., Huffman, arithmetic coding) emit variable‑length bitstreams, which break the lock‑step execution model of GPU warps. ZipServ replaces these with a triple‑bitmap layout: three parallel bitmaps encode sign, exponent, and mantissa bits of each weight in a fixed‑size block. Because each bitmap is a regular, word‑aligned array, every thread can read its slice independently, preserving SIMT execution.
-
Tensor‑Core integration – The three bitmaps are streamed directly into the Tensor Core matrix‑multiply unit. The authors built a custom ZipGEMM kernel that:
- Loads compressed bitmap blocks from global memory.
- Performs on‑the‑fly decompression inside registers (no extra global‑memory writes).
- Feeds the resulting FP16/FP32 values to the Tensor Core’s GEMM operation.
-
System‑level fusion – In a typical serving pipeline, a model weight is first decompressed into a dense buffer, then a separate GEMM kernel reads that buffer. ZipServ collapses these two steps, cutting the memory round‑trip in half and reducing cache pressure.
-
Evaluation – The authors benchmarked ZipServ on several state‑of‑the‑art LLMs (e.g., LLaMA‑7B, LLaMA‑13B) using NVIDIA A100 GPUs. They compared against:
- Uncompressed baseline (cuBLAS).
- Existing lossless compressors (e.g., DeepCompress).
- A popular serving framework (vLLM) for end‑to‑end latency.
Results & Findings
| Model | Compression Ratio | Kernel Speedup vs. cuBLAS | End‑to‑End Speedup vs. vLLM |
|---|---|---|---|
| LLaMA‑7B | 28 % smaller | 1.9× | 1.18× |
| LLaMA‑13B | 30 % smaller | 2.21× | 1.22× |
| GPT‑NeoX‑20B | 26 % smaller | 1.7× | 1.15× |
- Memory footprint drops by up to 30 %, allowing larger models to fit on a single GPU or freeing space for batch‑level parallelism.
- Kernel‑level throughput improves because the fused ZipGEMM eliminates the extra memory copy and leverages the full compute density of Tensor Cores.
- Overall latency sees a modest but consistent gain (≈ 1.2×) when integrated into a full serving stack, confirming that the compression overhead does not outweigh the compute benefits.
Practical Implications
- Cost‑effective scaling – Cloud providers can host bigger LLMs on the same GPU fleet, reducing hardware spend or enabling higher request concurrency.
- Edge‑oriented inference – The reduced memory requirement opens the door for deploying 7‑13 B‑parameter models on high‑end edge devices equipped with NVIDIA Jetson or similar GPUs.
- Simplified pipelines – Developers can replace the “load‑decompress‑compute” sequence with a single ZipGEMM call, decreasing code complexity and potential bugs.
- Compatibility – Because ZipServ works at the GEMM level, it can be dropped into existing frameworks (e.g., PyTorch, TensorFlow) via a custom CUDA kernel wrapper, without needing to retrain or fine‑tune the model.
- Future‑proofing – As newer GPUs expose larger Tensor‑Core matrices (e.g., Hopper’s FP8 support), the bitmap‑based encoding can be extended to match the native data formats, preserving the same speed‑up pattern.
Limitations & Future Work
- Hardware specificity – The current design is tightly coupled to NVIDIA Tensor Cores; porting to AMD or CPU‑based accelerators would require a different encoding or kernel strategy.
- Compression ceiling – Being lossless, ZipServ cannot achieve the dramatic size reductions of quantization or pruning; the 30 % savings is the practical upper bound observed.
- Kernel complexity – The fused kernel is more intricate than a standard GEMM, which may increase maintenance burden and limit immediate adoption in high‑level libraries.
- Future directions suggested by the authors include: extending TCA‑TBE to support mixed‑precision (e.g., FP8/FP16) pipelines, exploring adaptive bitmap granularity for different layers, and integrating the approach into multi‑GPU model parallelism frameworks.
Authors
- Ruibo Fan
- Xiangrui Yu
- Xinglin Pan
- Zeyu Li
- Weile Luo
- Qiang Wang
- Wei Wang
- Xiaowen Chu
Paper Information
- arXiv ID: 2603.17435v1
- Categories: cs.DC, cs.AR, cs.LG, cs.PF
- Published: March 18, 2026
- PDF: Download PDF