[Paper] MMA: A Momentum Mamba Architecture for Human Activity Recognition with Inertial Sensors
Source: arXiv - 2511.21550v1
Overview
The paper presents Momentum Mamba (MMA), a new neural architecture that builds on the recent Mamba state‑space model (SSM) to tackle human activity recognition (HAR) from inertial sensor streams. By injecting a momentum term—essentially a second‑order dynamic—MMA stabilizes information flow over long sequences, delivering higher accuracy and faster convergence while keeping computational costs modest.
Key Contributions
- Momentum‑augmented SSM: Introduces a second‑order “momentum” component to the classic first‑order Mamba, improving long‑range memory and gradient stability.
- Complex Momentum Mamba: Extends the idea to the complex domain, enabling frequency‑selective scaling of memory for richer temporal representations.
- Comprehensive HAR evaluation: Benchmarks MMA on several public inertial‑sensor datasets (e.g., UCI HAR, PAMAP2, HHAR) and shows consistent gains over vanilla Mamba, CNN/RNN baselines, and Transformers.
- Efficiency‑focused design: Achieves the accuracy boost with only a modest increase in FLOPs and training time, preserving the linear‑time complexity of SSMs.
- Robustness analysis: Demonstrates improved resilience to sensor noise and domain shifts, a common pain point in real‑world wearable deployments.
Methodology
-
Base Model – Mamba SSM:
- Mamba treats a sequence as the output of a linear state‑space system whose transition matrix is parameterized by a diagonal A and a convolution‑like D term.
- This yields O(N) time‑complexity (N = sequence length) and captures long‑range dependencies without the quadratic cost of self‑attention.
-
Adding Momentum:
- The authors augment the state update equation with a velocity term, turning the first‑order recurrence
h_t = A·h_{t‑1} + …into a second‑order one:v_t = μ·v_{t‑1} + (1‑μ)·(A·h_{t‑1} + …) h_t = h_{t‑1} + v_t - Here, μ is a learnable momentum coefficient (0 ≤ μ < 1). This mirrors physical momentum, smoothing rapid changes and preserving information over many steps.
- The authors augment the state update equation with a velocity term, turning the first‑order recurrence
-
Complex Momentum Variant:
- By allowing μ and the transition parameters to be complex numbers, the model can selectively amplify or dampen specific frequency bands, akin to a learnable filter bank.
-
Training Pipeline:
- Raw tri‑axial accelerometer and gyroscope streams are segmented into fixed‑length windows (e.g., 2 s at 50 Hz).
- Standard data augmentations (jitter, scaling, rotation) are applied.
- The model is trained with cross‑entropy loss, Adam optimizer, and a cosine‑annealing learning‑rate schedule.
Results & Findings
| Dataset | Baseline (Transformer) | Vanilla Mamba | MMA (Momentum) | MMA‑Complex |
|---|---|---|---|---|
| UCI HAR | 94.2 % | 94.7 % | 95.6 % | 95.4 % |
| PAMAP2 | 92.1 % | 92.8 % | 94.0 % | 93.8 % |
| HHAR | 88.5 % | 89.1 % | 90.3 % | 90.1 % |
- Accuracy: MMA consistently outperforms both Transformers and vanilla Mamba by 0.8–1.5 % absolute.
- Convergence: Reaches peak validation accuracy ~30 % faster (fewer epochs) thanks to smoother gradients from momentum.
- Robustness: Under synthetic sensor noise (Gaussian SNR = 10 dB), MMA’s drop in accuracy is ~0.4 % versus ~1.2 % for Transformers.
- Efficiency: Training FLOPs increase by ~12 % relative to vanilla Mamba, while inference latency remains linear and well under 5 ms on a mid‑range mobile CPU.
Practical Implications
- Edge‑friendly HAR: The linear‑time, low‑memory footprint makes MMA a strong candidate for on‑device activity classification in wearables, smartphones, and IoT gateways.
- Faster Model Iteration: Faster convergence reduces cloud‑training costs and shortens the time‑to‑market for new activity‑based features.
- Noise‑Resilient Deployments: Improved robustness means fewer false detections in real‑world scenarios where sensor placement and signal quality vary.
- Transferable Architecture: Because momentum‑augmented SSMs are generic sequence models, developers can reuse MMA for other time‑series tasks—e.g., predictive maintenance, speech keyword spotting, or financial tick‑data analysis—without redesigning the core network.
- Simplified Pipeline: MMA eliminates the need for heavy attention‑based layers or deep RNN stacks, streamlining model‑serving stacks and reducing dependency on specialized hardware accelerators.
Limitations & Future Work
- Second‑Order Overhead: While modest, the added velocity state doubles the hidden‑state size, which may be noticeable on ultra‑low‑power microcontrollers.
- Complex Momentum Stability: Training with complex‑valued parameters requires careful initialization; the authors note occasional divergence on very long sequences (>10 s).
- Domain Generalization: Experiments focus on benchmark datasets; real‑world cross‑subject or cross‑device generalization still needs thorough validation.
- Future Directions: The authors suggest exploring adaptive momentum schedules, hybridizing MMA with lightweight attention for multimodal inputs, and extending the framework to unsupervised pre‑training for sensor data.
Authors
- Thai‑Khanh Nguyen
- Uyen Vo
- Tan M. Nguyen
- Thieu N. Vo
- Trung‑Hieu Le
- Cuong Pham
Paper Information
- arXiv ID: 2511.21550v1
- Categories: cs.HC, cs.LG
- Published: November 26, 2025
- PDF: Download PDF