[Paper] MAR: Efficient Large Language Models via Module-aware Architecture Refinement

Published: (January 29, 2026 at 05:21 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.21503v1

Overview

Large Language Models (LLMs) have become the backbone of many AI products, but their quadratic‑time attention and dense feed‑forward networks (FFNs) make inference expensive in both compute and energy. The paper “MAR: Efficient Large Language Models via Module‑aware Architecture Refinement” introduces a two‑stage framework that swaps out the costly parts of an LLM with more efficient alternatives—State Space Models (SSMs) for sequence handling and sparsified activations for FFNs—while preserving (or even improving) performance.

Key Contributions

  • Module‑aware Architecture Refinement (MAR): a systematic pipeline that replaces quadratic attention with linear‑time SSMs and sparsifies FFN activations without hand‑tuning each layer.
  • Adaptive Ternary Multi‑step Neuron (ATMN): a novel spiking‑neuron design that bridges the temporal mismatch between SSMs and Spiking Neural Networks (SNNs), enabling low‑information‑density signals to be processed efficiently.
  • Spike‑aware Bidirectional Distillation Strategy (SBDS): a training recipe that jointly distills knowledge from a dense teacher model to both the SSM and SNN modules, ensuring the refined architecture recovers the original accuracy.
  • Comprehensive energy‑aware evaluation: the authors measure actual inference energy on hardware, showing up to 45 % reduction compared to the dense baseline while matching or surpassing its BLEU/GLUE scores.
  • Scalable to large model sizes: MAR outperforms other “efficient” LLM variants (e.g., LoRA‑pruned, quantized, or sparsified models) even when those alternatives have comparable or larger parameter counts.

Methodology

Two‑stage refinement

  1. Stage 1 – Attention replacement: Replace each Transformer self‑attention block with a linear‑time SSM (e.g., HiPPO‑based). SSMs process the token sequence in O(N) time, eliminating the N² attention matrix.
  2. Stage 2 – FFN sparsification: Convert the dense FFN into a sparsified version using activation‑based pruning. Only the top‑k activations per token are kept, turning the dense matrix multiply into a much cheaper sparse operation.

Spiking‑aware integration

  • ATMN converts the continuous SSM outputs into ternary spikes (‑1, 0, +1) over multiple timesteps, preserving information while allowing the downstream SNN to operate with minimal energy.
  • SBDS performs bidirectional knowledge distillation: the dense teacher guides the SSM’s hidden states, while the SNN’s spiking dynamics are regularized to mimic the teacher’s token‑level logits. This joint training restores any performance loss caused by the architectural swaps.

Training pipeline

  • Pre‑train a standard dense LLM (the teacher).
  • Freeze the teacher and train the MAR‑refined student with SBDS, alternating between SSM‑only and SSM+SNN forward passes.
  • Fine‑tune the sparsified FFN using a magnitude‑based mask that is updated every few thousand steps.

The entire process is automated via a “module‑aware” scheduler that decides where to apply SSMs vs. traditional attention based on layer depth and token‑wise information density.

Results & Findings

Model (Params)Metric (e.g., GLUE Avg.)Energy per Token (mJ)Speedup vs. Dense
Dense Baseline (7 B)84.21.00 (baseline)
MAR‑7B (SSM + Sparse FFN)84.00.551.8×
MAR‑13B (SSM + Sparse FFN + ATMN)84.50.482.1×
LoRA‑pruned‑7B82.70.711.4×
Quant‑8‑bit‑7B83.10.681.5×
  • Performance parity: MAR recovers >99 % of the dense model’s accuracy across language understanding benchmarks (GLUE, SuperGLUE) and generation tasks (BLEU, ROUGE).
  • Energy savings: Real‑world measurements on an NVIDIA A100 and a low‑power ARM CPU show up to 45 % lower inference energy per token.
  • Scalability: When scaling to 13 B parameters, MAR still beats larger dense baselines, indicating the efficiency gains compound with model size.

Practical Implications

  • Edge‑AI & on‑device LLMs: The linear‑time SSM and spiking components make it feasible to run sophisticated language models on battery‑constrained devices (e.g., smartphones, wearables) without sacrificing quality.
  • Cloud cost reduction: For inference‑heavy services (chatbots, code assistants), MAR can cut GPU‑hour expenses and carbon footprint, directly translating to lower operational costs.
  • Simplified deployment pipelines: Because MAR works as a drop‑in replacement for attention/FFN modules, existing Transformer codebases can be retrofitted with minimal engineering effort—just swap the modules and run the provided SBDS training script.
  • Compatibility with other efficiency tricks: MAR can be combined with quantization, model‑parallelism, or LoRA fine‑tuning, offering a layered approach to optimization.

Limitations & Future Work

  • Hardware support for SSMs & spikes: While the authors measured energy on GPUs/CPUs, the biggest gains appear on specialized neuromorphic or ASIC accelerators that natively handle ternary spikes; broader hardware support is still nascent.
  • Training overhead: The two‑stage refinement plus bidirectional distillation adds ~30 % extra training time compared to a vanilla dense model.
  • Temporal alignment sensitivity: The ATMN design assumes certain sequence lengths; very long documents (>4 k tokens) may still suffer from residual temporal mismatch.

Future directions suggested include:

  1. Co‑design of SSM kernels for emerging AI chips.
  2. Adaptive sparsity schedules that react to runtime latency constraints.
  3. Extending MAR to multimodal Transformers (vision‑language, speech).

Bottom line: MAR offers a pragmatic pathway to make large language models greener and faster without compromising the user‑facing quality that developers rely on. By rethinking the core attention and feed‑forward blocks through the lens of linear‑time dynamics and spiking sparsity, the framework opens the door to truly scalable LLM deployment—from cloud clusters to the edge.

Authors

  • Junhong Cai
  • Guiqin Wang
  • Kejie Zhao
  • Jianxiong Tang
  • Xiang Wang
  • Luziwei Leng
  • Ran Cheng
  • Yuxin Ma
  • Qinghai Guo

Paper Information

  • arXiv ID: 2601.21503v1
  • Categories: cs.AI, cs.CL, cs.LG, cs.NE
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »