[Paper] QL-LSTM: A Parameter-Efficient LSTM for Stable Long-Sequence Modeling

Published: 1 week ago (December 6, 2025 at 05:29 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.06582v1

Overview

The paper introduces QL‑LSTM, a re‑engineered Long Short‑Term Memory network that slashes the number of trainable parameters by almost half while preserving the full expressive power of the classic gating mechanisms. By tackling two long‑standing pain points—parameter redundancy and fading memory over very long sequences—QL‑LSTM promises a leaner, more stable recurrent model for real‑world NLP and time‑series tasks.

Key Contributions

Parameter‑Shared Unified Gating (PSUG): Replaces the four separate gate weight matrices (input, forget, output, candidate) with a single shared matrix, cutting LSTM parameters by ~48 % without sacrificing gating flexibility.
Hierarchical Gated Recurrence with Additive Skip Connections (HGR‑ASC): Introduces a multiplication‑free skip pathway that carries raw hidden states forward, mitigating forget‑gate decay and improving long‑range information flow.
Empirical validation on extended‑length IMDB sentiment classification: Shows competitive accuracy to standard LSTM/GRU/BiLSTM baselines despite the reduced parameter budget.
Analysis of per‑step computational efficiency: Demonstrates that PSUG and HGR‑ASC are cheaper per time step, laying groundwork for future speed‑up optimizations.

Methodology

Unified Gating Layer – Instead of learning distinct weight matrices for each gate, QL‑LSTM learns a single matrix W that is applied to the concatenated input‑hidden vector. The resulting vector is split and passed through the usual sigmoid/tanh activations to produce the four gate signals. This sharing forces the model to reuse representations across gates, dramatically reducing the parameter count.
Additive Skip Path – Alongside the standard recurrent update (which multiplies the previous hidden state by the forget gate), QL‑LSTM adds an unmodulated copy of the previous hidden state to the new candidate. The update equation becomes:
```
h_t = f_t \odot h_{t-1} + (1 - f_t) \odot \tilde{h}_t + \alpha \, h_{t-1}
```
where (\alpha) is a small learned scalar (or fixed constant). This “skip” term bypasses the forget gate, preserving information that would otherwise be attenuated over many steps.
Training & Evaluation – The authors train the model on the IMDB movie‑review dataset, artificially extending document lengths to stress long‑range dependencies. Hyper‑parameters (hidden size, learning rate, dropout) are kept comparable across all baselines to ensure a fair comparison.

Results & Findings

Model	Params (M)	Test Accuracy	Relative Params ↓
Standard LSTM	2.1	88.3 %	–
GRU	1.9	87.9 %	–
BiLSTM	4.2	89.0 %	–
QL‑LSTM	1.1	88.1 %	≈48 %

Accuracy: QL‑LSTM matches or slightly trails the best baseline (BiLSTM) while using less than half the parameters of a vanilla LSTM.
Memory retention: Ablation studies show that the additive skip connection reduces the degradation of the forget gate’s influence, leading to higher hidden‑state similarity across distant time steps.
Compute per step: The unified gating reduces matrix‑multiplication count, and the skip path eliminates one multiplication, yielding a modest per‑step FLOP reduction. However, wall‑clock speed gains were not observed without low‑level kernel optimizations.

Practical Implications

Deployments on edge devices: Halving the parameter footprint translates directly into smaller model binaries and lower RAM usage—critical for mobile or IoT applications that still need recurrent modeling (e.g., on‑device speech recognition, sensor‑fusion).
Faster training cycles: Fewer parameters mean quicker gradient updates and less GPU memory pressure, allowing larger batch sizes or longer sequences during experimentation.
Improved long‑sequence handling: The additive skip connection can be a drop‑in replacement for standard LSTM cells in any pipeline that suffers from vanishing memory (e.g., document‑level sentiment, legal‑text analysis, or financial time‑series).
Compatibility: Because QL‑LSTM retains the classic LSTM interface (same input/output signatures), existing codebases can swap in the new cell with minimal refactoring.

Limitations & Future Work

Sequential bottleneck remains: Despite per‑step efficiency gains, QL‑LSTM still inherits the inherently sequential execution of RNNs, so raw inference latency does not improve without custom CUDA kernels or hardware‑level parallelism.
Evaluation scope: The study focuses on a single NLP benchmark (IMDB) with artificially lengthened inputs; broader testing on speech, video, or multivariate sensor streams is needed to confirm generality.
Hyper‑parameter sensitivity: The scalar (\alpha) governing the skip connection may require careful tuning for different domains, and the paper does not explore adaptive schemes.
Future directions: The authors suggest integrating QL‑LSTM into transformer‑style hybrid models, exploring mixed‑precision kernels, and extending the unified gating concept to other gated architectures (e.g., GRU, SimpleRNN).

Bottom line: QL‑LSTM demonstrates that we can retain the expressive gating dynamics of LSTMs while dramatically trimming the parameter budget and bolstering long‑range memory. For developers building resource‑constrained, sequence‑heavy applications, it offers a pragmatic upgrade path—provided the underlying execution engine can exploit its per‑step efficiencies.

Authors

Isaac Kofi Nti

Paper Information

arXiv ID: 2512.06582v1
Categories: cs.LG, cs.AI, cs.NE
Published: December 6, 2025
PDF: Download PDF

[Paper] QL-LSTM: A Parameter-Efficient LSTM for Stable Long-Sequence Modeling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Spatia: Video Generation with Updatable Spatial Memory

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

[Paper] Artism: AI-Driven Dual-Engine System for Art Generation and Critique

[Paper] Learning Model Parameter Dynamics in a Combination Therapy for Bladder Cancer from Sparse Biological Data