[Paper] From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Published: 3 days ago (June 1, 2026 at 01:52 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.02559v1

Overview

The paper “From Layers to Submodules: Rethinking Granularity in Replacement‑Based LLM Compression” challenges the prevailing assumption that large language models (LLMs) must be pruned or replaced at whole‑layer granularity. By moving the compression granularity down to the level of individual submodules (the attention heads and feed‑forward blocks inside each transformer layer), the authors show that you can keep more of the model’s predictive power while still gaining speed and memory savings.

Key Contributions

SubFit framework – a novel post‑training compression pipeline that selects non‑contiguous attention and feed‑forward submodules for replacement and equips each with its own lightweight fitted residual bypass.
Granularity shift – demonstrates that redundancy in pretrained transformers is distributed irregularly across submodules, not confined to contiguous layers, and that different submodule types benefit from tailored replacement strategies.
Comprehensive evaluation – experiments on ten LLMs (five base models and five instruction‑tuned variants) across five sparsity levels (12.5 %–37.5 %) and against four state‑of‑the‑art replacement‑based baselines.
Strong empirical gains – at 25 % sparsity SubFit retains 84.6 % of dense downstream accuracy with only 2.42× perplexity degradation, outperforming the best baseline (81.6 % accuracy, 4.34× perplexity).
Practical speedups – measurable inference latency reductions and KV‑cache memory savings, making the method attractive for real‑world deployment.
Open‑source release – code and calibration scripts are publicly available, enabling reproducibility and easy integration into existing pipelines.

Methodology

Calibration‑only post‑training – SubFit does not require any further pre‑training; it only needs a modest calibration dataset (e.g., a few thousand unlabeled tokens).
Submodule selection – each transformer layer is broken into its constituent attention block and feed‑forward block. A sparsity budget is allocated, and a scoring function (based on activation statistics and sensitivity analysis) ranks submodules for removal. Importantly, the selected submodules can be scattered throughout the network rather than forming a contiguous block.
Fitted residual bypass – for every removed submodule, a tiny neural “bypass” module is trained to predict the residual output that the original submodule would have produced. This bypass is lightweight (often a single linear layer or a shallow MLP) and is trained on the calibration data to minimize the reconstruction error.
Integration – the bypass is inserted in place of the original submodule, preserving the model’s overall architecture while reducing the number of heavy transformer components.
Evaluation – the compressed model is tested on standard language modeling perplexity benchmarks and downstream task accuracy (e.g., classification, QA) to assess the trade‑off between compression and performance.

Results & Findings

Sparsity	Accuracy (downstream)	Perplexity Δ (×)	Speedup (inference)	KV‑cache reduction
12.5 %	92.3 % (vs. 94.1 % dense)	1.68×	+12 %	–8 %
25 %	84.6 % (vs. 86.9 % dense)	2.42×	+22 %	–15 %
37.5 %	78.1 % (vs. 81.2 % dense)	3.71×	+35 %	–23 %

Across all ten models, SubFit consistently outperformed the four baselines on the aggregate perplexity‑accuracy trade‑off.
The advantage grew larger as compression became more aggressive (≥ 30 % sparsity), confirming the hypothesis that fine‑grained submodule removal better preserves critical information.
KV‑cache savings stem from the fact that removed submodules no longer need to store intermediate activations for future tokens, directly translating into lower memory footprints for long‑context generation.

Practical Implications

Deployments on edge or low‑resource servers – SubFit can shrink a 7B‑parameter model to roughly 5B parameters while still delivering > 80 % of its original task performance, enabling cheaper inference on CPUs or smaller GPUs.
Dynamic model scaling – Because submodule removal is non‑contiguous, developers can fine‑tune the sparsity pattern to match specific hardware constraints (e.g., fitting a model into a given KV‑cache budget).
Compatibility with existing toolchains – The method works post‑training and only needs calibration data, so it can be slotted into CI pipelines after a model is released, without retraining from scratch.
Potential for mixed‑precision pipelines – The lightweight bypass modules are amenable to aggressive quantization (e.g., 4‑bit), further reducing latency and memory while keeping the heavy transformer blocks at higher precision.
Open‑source code – The provided repository includes scripts for automated submodule ranking, bypass training, and evaluation, lowering the barrier for teams to experiment on their own LLMs.

Limitations & Future Work

Calibration data dependency – While only a small dataset is required, the quality of the bypass training can be sensitive to the representativeness of that data; highly domain‑specific models may need careful calibration set selection.
Bypass overhead – Although lightweight, the added bypass modules introduce extra parameters and a small compute cost; the net speedup depends on the hardware’s ability to parallelize these small operations.
Scope of evaluation – Experiments focused on language modeling and a handful of downstream tasks; broader benchmarks (e.g., multi‑modal LLMs, code generation) remain to be tested.
Theoretical understanding – The paper provides empirical evidence for submodule redundancy but leaves a formal analysis of why certain submodules are more replaceable than others for future research.

Overall, SubFit opens a promising avenue for more nuanced LLM compression, offering developers a practical tool to balance model size, speed, and performance in production environments.

Authors

Elia Cunegatti
Marcus Vukojevic
Erik Nielsen
Giovanni Iacca

Paper Information

arXiv ID: 2606.02559v1
Categories: cs.CL, cs.AI
Published: June 1, 2026
PDF: Download PDF

[Paper] From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

[Paper] Streaming Communication in Multi-Agent Reasoning

[Paper] Reinforcement Learning from Rich Feedback with Distributional DAgger

[Paper] Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)