[Paper] SPQ: An Ensemble Technique for Large Language Model Compression

Published: (February 20, 2026 at 01:44 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.18420v1

Overview

The paper introduces SPQ, a three‑step ensemble method that compresses large language models (LLMs) without sacrificing accuracy. By chaining Singular Value Decomposition (SVD), activation‑based pruning, and 8‑bit post‑training quantization, the authors show that LLaMA‑2‑7B can be shrunk by up to 75 % while keeping (or even improving) perplexity and downstream task performance.

Key Contributions

  • Unified compression pipeline that combines three complementary techniques (SVD + pruning + quantization).
  • Layer‑aware SVD that factorizes attention projection matrices into low‑rank components while preserving variance.
  • Activation‑driven pruning that removes redundant MLP neurons based on runtime statistics, not just static weight magnitude.
  • Memory‑efficient 8‑bit linear quantization applied after the first two steps, enabling a single‑pass post‑training compression.
  • Empirical validation on LLaMA‑2‑7B across language modeling (WikiText‑2, C4) and reasoning benchmarks (TruthfulQA, GSM8K), outperforming single‑method baselines and matching strong competitors like GPTQ and SparseGPT.
  • Speedup up to 1.9× inference throughput compared with GPTQ, with a lower peak memory footprint (6.86 GB vs. 7.16 GB).

Methodology

  1. SVD Compression – Each attention head’s projection matrix (W) is decomposed into (U\Sigma V^\top). By keeping only the top‑k singular values that retain a target variance (e.g., 99 %), the matrix is replaced by two smaller factors, reducing FLOPs and memory.
  2. Activation‑Based Pruning – During a short calibration run on a representative dataset, the average activation magnitude of every MLP neuron is recorded. Neurons whose activations fall below a percentile threshold are pruned, and the surrounding weight matrices are re‑wired accordingly. This removes “dead” capacity that does not contribute to the model’s output.
  3. 8‑Bit Linear Quantization – After SVD and pruning, all remaining linear layers are quantized to 8‑bit integers using a standard post‑training quantizer (e.g., per‑channel min‑max scaling). No fine‑tuning is required, keeping the pipeline fast and hardware‑friendly.

The three steps are applied sequentially but are designed to be orthogonal: SVD tackles low‑rank redundancy in attention, pruning eliminates unnecessary MLP neurons, and quantization compresses everything uniformly. The authors also provide a simple hyper‑parameter sweep (rank retention, pruning percentile, quantization scheme) that can be automated for any target compression ratio.

Results & Findings

Model / DatasetBaseline PerplexitySPQ (75 % compression)GPTQ (similar memory)
LLaMA‑2‑7B (WikiText‑2)5.474.91 (improved)5.12
LLaMA‑2‑7B (C4)7.317.057.08
TruthfulQA (accuracy)71.2 %71.0 %70.8 %
GSM8K (score)71.571.371.1
  • Memory reduction: up to 75 % (peak RAM from ~27 GB to ~6.8 GB).
  • Throughput: 1.3–1.9× faster than GPTQ on a single A100 GPU.
  • Compression trade‑off: At lower compression ratios (e.g., 50 %), SPQ matches the perplexity of the original model while still halving memory usage.

The experiments confirm that the ensemble approach consistently beats any single technique applied in isolation, highlighting the complementary nature of the three methods.

Practical Implications

  • Edge & on‑premise deployment: Developers can now run 7‑billion‑parameter LLMs on commodity GPUs or even high‑end CPUs with modest RAM, opening up private‑cloud or on‑device inference scenarios.
  • Cost‑effective serving: Lower memory footprints translate to smaller VM instances or higher model density per GPU, reducing cloud‑hosting expenses.
  • Faster response times: The observed inference speedup means lower latency for chat‑bot or code‑completion services, improving user experience.
  • Simplified pipeline: Because SPQ is a post‑training process that does not require expensive fine‑tuning, teams can integrate it into existing CI/CD workflows with minimal engineering overhead.
  • Compatibility: The final 8‑bit model can be loaded by standard inference runtimes (e.g., Hugging Face Transformers, vLLM) without custom kernels, easing adoption.

Limitations & Future Work

  • Calibration data dependence: Pruning decisions rely on a small calibration set; if this set is not representative, some useful neurons might be removed.
  • Fixed rank selection: The current SVD step uses a global variance threshold; adaptive per‑layer rank selection could yield better trade‑offs.
  • Quantization granularity: Only uniform 8‑bit quantization is explored; mixed‑precision or newer integer formats (e.g., 4‑bit) might push compression further.
  • Scalability to >30B models: Experiments focus on a 7B model; extending SPQ to truly massive LLMs may require additional memory‑efficient SVD algorithms or distributed pruning.

The authors suggest exploring automated hyper‑parameter search, integrating knowledge‑distillation fine‑tuning after compression, and testing SPQ on multimodal models as promising next steps.

Authors

  • Jiamin Yao
  • Eren Gultepe

Paper Information

  • arXiv ID: 2602.18420v1
  • Categories: cs.CL
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »