[Paper] QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models
Source: arXiv - 2601.00679v1
Overview
The paper introduces QSLM, an automated quantization framework that compresses spike‑driven language models (SLMs) without sacrificing their accuracy. By jointly optimizing for performance and memory, QSLM makes it feasible to run sophisticated language models on low‑power, resource‑constrained edge devices.
Key Contributions
- Automated, tiered quantization (global → block → module level) that adapts to the hierarchical structure of any pre‑trained SLM.
- Multi‑objective trade‑off function that balances inference latency, power draw, and memory usage while preserving task‑level accuracy.
- Sensitivity‑aware layer analysis that quickly identifies which parts of the network can be aggressively quantized and which need higher precision.
- Empirical validation on sentiment classification (SST‑2) and language generation (WikiText‑2), showing up to 86.5 % memory reduction and ≈20 % power savings with less than a 2 % drop in accuracy/perplexity.
Methodology
- Architecture profiling – QSLM parses the SLM to build a hierarchy (layers → blocks → modules) and measures each component’s sensitivity to quantization using a lightweight calibration set.
- Tiered search strategy –
- Global level: applies a coarse‑grained bit‑width (e.g., 8‑bit) across the whole model.
- Block level: refines the bit‑width for individual transformer blocks based on their sensitivity scores.
- Module level: fine‑tunes critical sub‑modules (e.g., attention heads, feed‑forward nets) to higher precision if needed.
- Multi‑objective optimization – a weighted cost function evaluates candidate quantization schemes against user‑defined constraints (max memory, target latency, acceptable accuracy loss). The optimizer selects the configuration that best satisfies all constraints.
- Post‑training quantization – the chosen scheme is applied without re‑training, keeping the deployment pipeline fast and lightweight.
Results & Findings
| Metric | Baseline (non‑quantized) | QSLM‑quantized | Relative Change |
|---|---|---|---|
| Memory footprint | 100 % | 13.5 % (‑86.5 %) | ↓ |
| Power consumption (inference) | 100 % | ≈80 % (‑20 %) | ↓ |
| SST‑2 accuracy | 86.4 % | 84.4 % | –2 % |
| WikiText‑2 perplexity | 22.5 | 23.2 | +0.7 |
The results demonstrate that QSLM can dramatically shrink model size and energy use while keeping task performance within a few percentage points of the original model—well within typical tolerances for edge applications.
Practical Implications
- Edge AI deployment – Developers can now fit SLMs onto microcontrollers, wearables, or IoT gateways that previously lacked the RAM to host even a tiny LLM.
- Reduced cloud reliance – On‑device inference cuts latency and data‑privacy concerns, enabling real‑time language understanding (e.g., voice assistants, on‑device summarization).
- Fast design cycles – Because QSLM works post‑training, teams can quantize new SLM releases automatically, avoiding the manual, trial‑and‑error tuning that traditionally bottlenecks model compression pipelines.
- Energy‑aware scheduling – The framework’s power‑aware objective lets system integrators trade a small accuracy dip for measurable battery life extensions in battery‑operated products.
Limitations & Future Work
- Calibration data dependence – Sensitivity analysis relies on a representative dataset; mismatches can lead to sub‑optimal bit‑width choices for unseen inputs.
- Fixed quantization scheme – QSLM currently supports uniform integer quantization; exploring mixed‑precision or non‑uniform schemes could yield further gains.
- Scalability to massive LLMs – While effective on spike‑driven models, applying the same tiered search to full‑scale transformer LLMs may require additional heuristics to keep search time tractable.
The authors suggest extending QSLM to support dynamic runtime quantization and integrating hardware‑aware cost models for emerging neuromorphic accelerators.
Authors
- Rachmad Vidya Wicaksana Putra
- Pasindu Wickramasinghe
- Muhammad Shafique
Paper Information
- arXiv ID: 2601.00679v1
- Categories: cs.NE, cs.AI, cs.LG
- Published: January 2, 2026
- PDF: Download PDF