[Paper] From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Published: (June 1, 2026 at 01:52 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.02559v1

Overview

The paper “From Layers to Submodules: Rethinking Granularity in Replacement‑Based LLM Compression” challenges the prevailing assumption that large language models (LLMs) must be pruned or replaced at whole‑layer granularity. By moving the compression granularity down to the level of individual submodules (the attention heads and feed‑forward blocks inside each transformer layer), the authors show that you can keep more of the model’s predictive power while still gaining speed and memory savings.

Key Contributions

  • SubFit framework – a novel post‑training compression pipeline that selects non‑contiguous attention and feed‑forward submodules for replacement and equips each with its own lightweight fitted residual bypass.
  • Granularity shift – demonstrates that redundancy in pretrained transformers is distributed irregularly across submodules, not confined to contiguous layers, and that different submodule types benefit from tailored replacement strategies.
  • Comprehensive evaluation – experiments on ten LLMs (five base models and five instruction‑tuned variants) across five sparsity levels (12.5 %–37.5 %) and against four state‑of‑the‑art replacement‑based baselines.
  • Strong empirical gains – at 25 % sparsity SubFit retains 84.6 % of dense downstream accuracy with only 2.42× perplexity degradation, outperforming the best baseline (81.6 % accuracy, 4.34× perplexity).
  • Practical speedups – measurable inference latency reductions and KV‑cache memory savings, making the method attractive for real‑world deployment.
  • Open‑source release – code and calibration scripts are publicly available, enabling reproducibility and easy integration into existing pipelines.

Methodology

  1. Calibration‑only post‑training – SubFit does not require any further pre‑training; it only needs a modest calibration dataset (e.g., a few thousand unlabeled tokens).
  2. Submodule selection – each transformer layer is broken into its constituent attention block and feed‑forward block. A sparsity budget is allocated, and a scoring function (based on activation statistics and sensitivity analysis) ranks submodules for removal. Importantly, the selected submodules can be scattered throughout the network rather than forming a contiguous block.
  3. Fitted residual bypass – for every removed submodule, a tiny neural “bypass” module is trained to predict the residual output that the original submodule would have produced. This bypass is lightweight (often a single linear layer or a shallow MLP) and is trained on the calibration data to minimize the reconstruction error.
  4. Integration – the bypass is inserted in place of the original submodule, preserving the model’s overall architecture while reducing the number of heavy transformer components.
  5. Evaluation – the compressed model is tested on standard language modeling perplexity benchmarks and downstream task accuracy (e.g., classification, QA) to assess the trade‑off between compression and performance.

Results & Findings

SparsityAccuracy (downstream)Perplexity Δ (×)Speedup (inference)KV‑cache reduction
12.5 %92.3 % (vs. 94.1 % dense)1.68×+12 %–8 %
25 %84.6 % (vs. 86.9 % dense)2.42×+22 %–15 %
37.5 %78.1 % (vs. 81.2 % dense)3.71×+35 %–23 %
  • Across all ten models, SubFit consistently outperformed the four baselines on the aggregate perplexity‑accuracy trade‑off.
  • The advantage grew larger as compression became more aggressive (≥ 30 % sparsity), confirming the hypothesis that fine‑grained submodule removal better preserves critical information.
  • KV‑cache savings stem from the fact that removed submodules no longer need to store intermediate activations for future tokens, directly translating into lower memory footprints for long‑context generation.

Practical Implications

  • Deployments on edge or low‑resource servers – SubFit can shrink a 7B‑parameter model to roughly 5B parameters while still delivering > 80 % of its original task performance, enabling cheaper inference on CPUs or smaller GPUs.
  • Dynamic model scaling – Because submodule removal is non‑contiguous, developers can fine‑tune the sparsity pattern to match specific hardware constraints (e.g., fitting a model into a given KV‑cache budget).
  • Compatibility with existing toolchains – The method works post‑training and only needs calibration data, so it can be slotted into CI pipelines after a model is released, without retraining from scratch.
  • Potential for mixed‑precision pipelines – The lightweight bypass modules are amenable to aggressive quantization (e.g., 4‑bit), further reducing latency and memory while keeping the heavy transformer blocks at higher precision.
  • Open‑source code – The provided repository includes scripts for automated submodule ranking, bypass training, and evaluation, lowering the barrier for teams to experiment on their own LLMs.

Limitations & Future Work

  • Calibration data dependency – While only a small dataset is required, the quality of the bypass training can be sensitive to the representativeness of that data; highly domain‑specific models may need careful calibration set selection.
  • Bypass overhead – Although lightweight, the added bypass modules introduce extra parameters and a small compute cost; the net speedup depends on the hardware’s ability to parallelize these small operations.
  • Scope of evaluation – Experiments focused on language modeling and a handful of downstream tasks; broader benchmarks (e.g., multi‑modal LLMs, code generation) remain to be tested.
  • Theoretical understanding – The paper provides empirical evidence for submodule redundancy but leaves a formal analysis of why certain submodules are more replaceable than others for future research.

Overall, SubFit opens a promising avenue for more nuanced LLM compression, offering developers a practical tool to balance model size, speed, and performance in production environments.

Authors

  • Elia Cunegatti
  • Marcus Vukojevic
  • Erik Nielsen
  • Giovanni Iacca

Paper Information

  • arXiv ID: 2606.02559v1
  • Categories: cs.CL, cs.AI
  • Published: June 1, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »