[Paper] QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

Published: 1 month ago (January 2, 2026 at 08:05 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.00679v1

Overview

The paper introduces QSLM, an automated quantization framework that compresses spike‑driven language models (SLMs) without sacrificing their accuracy. By jointly optimizing for performance and memory, QSLM makes it feasible to run sophisticated language models on low‑power, resource‑constrained edge devices.

Key Contributions

Automated, tiered quantization (global → block → module level) that adapts to the hierarchical structure of any pre‑trained SLM.
Multi‑objective trade‑off function that balances inference latency, power draw, and memory usage while preserving task‑level accuracy.
Sensitivity‑aware layer analysis that quickly identifies which parts of the network can be aggressively quantized and which need higher precision.
Empirical validation on sentiment classification (SST‑2) and language generation (WikiText‑2), showing up to 86.5 % memory reduction and ≈20 % power savings with less than a 2 % drop in accuracy/perplexity.

Methodology

Architecture profiling – QSLM parses the SLM to build a hierarchy (layers → blocks → modules) and measures each component’s sensitivity to quantization using a lightweight calibration set.
Tiered search strategy –
- Global level: applies a coarse‑grained bit‑width (e.g., 8‑bit) across the whole model.
- Block level: refines the bit‑width for individual transformer blocks based on their sensitivity scores.
- Module level: fine‑tunes critical sub‑modules (e.g., attention heads, feed‑forward nets) to higher precision if needed.
Multi‑objective optimization – a weighted cost function evaluates candidate quantization schemes against user‑defined constraints (max memory, target latency, acceptable accuracy loss). The optimizer selects the configuration that best satisfies all constraints.
Post‑training quantization – the chosen scheme is applied without re‑training, keeping the deployment pipeline fast and lightweight.

Results & Findings

Metric	Baseline (non‑quantized)	QSLM‑quantized	Relative Change
Memory footprint	100 %	13.5 % (‑86.5 %)	↓
Power consumption (inference)	100 %	≈80 % (‑20 %)	↓
SST‑2 accuracy	86.4 %	84.4 %	–2 %
WikiText‑2 perplexity	22.5	23.2	+0.7

The results demonstrate that QSLM can dramatically shrink model size and energy use while keeping task performance within a few percentage points of the original model—well within typical tolerances for edge applications.

Practical Implications

Edge AI deployment – Developers can now fit SLMs onto microcontrollers, wearables, or IoT gateways that previously lacked the RAM to host even a tiny LLM.
Reduced cloud reliance – On‑device inference cuts latency and data‑privacy concerns, enabling real‑time language understanding (e.g., voice assistants, on‑device summarization).
Fast design cycles – Because QSLM works post‑training, teams can quantize new SLM releases automatically, avoiding the manual, trial‑and‑error tuning that traditionally bottlenecks model compression pipelines.
Energy‑aware scheduling – The framework’s power‑aware objective lets system integrators trade a small accuracy dip for measurable battery life extensions in battery‑operated products.

Limitations & Future Work

Calibration data dependence – Sensitivity analysis relies on a representative dataset; mismatches can lead to sub‑optimal bit‑width choices for unseen inputs.
Fixed quantization scheme – QSLM currently supports uniform integer quantization; exploring mixed‑precision or non‑uniform schemes could yield further gains.
Scalability to massive LLMs – While effective on spike‑driven models, applying the same tiered search to full‑scale transformer LLMs may require additional heuristics to keep search time tractable.

The authors suggest extending QSLM to support dynamic runtime quantization and integrating hardware‑aware cost models for emerging neuromorphic accelerators.

Authors

Rachmad Vidya Wicaksana Putra
Pasindu Wickramasinghe
Muhammad Shafique

Paper Information

arXiv ID: 2601.00679v1
Categories: cs.NE, cs.AI, cs.LG
Published: January 2, 2026
PDF: Download PDF

[Paper] QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Categorical Reparameterization with Denoising Diffusion models