[Paper] IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

Published: 2 months ago (November 26, 2025 at 10:46 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21513v1

Overview

Transformers have become the de‑facto backbone for many AI services, but running them on edge devices (phones, IoT gateways, AR glasses) is still a major challenge because the attention block is both compute‑heavy and memory‑intensive. The paper IntAttention proposes the first fully integer‑only attention pipeline that eliminates the costly floating‑point softmax step, delivering up to 3.7× speed‑up and 61 % energy savings on commodity Armv8 CPUs—without any model retraining.

Key Contributions

IndexSoftmax: a novel integer‑only softmax replacement that uses a tiny 32‑entry lookup table and integer arithmetic, removing the de‑quantize‑softmax‑re‑quantize loop that dominates latency in existing INT8 pipelines.
Plug‑and‑play design: works with off‑the‑shelf quantized Transformer models (INT8 weights/activations) and can be dropped into existing inference frameworks with zero retraining.
Sparsity‑aware clipping: dynamically caps extreme activation values before the lookup, preserving numerical stability while keeping the integer range tight.
Comprehensive evaluation: demonstrates consistent speed and energy gains across language (BERT, GPT‑2) and vision (ViT) models on real edge hardware, while keeping accuracy within <0.5 % of FP16 baselines.
Open‑source roadmap: code and kernels will be released, encouraging adoption in mobile SDKs and edge AI runtimes.

Methodology

Problem identification – In an INT8‑quantized Transformer, matrix multiplications run fast, but the softmax still runs in FP16/FP32. Converting the integer scores to floating point, applying the exponential, normalizing, and converting back can consume up to two‑thirds of the total attention latency.
Integer‑only softmax (IndexSoftmax) –
- Clipping: The raw attention scores (int32) are first clipped based on a sparsity‑aware threshold, ensuring they fit into a small dynamic range.
- Lookup table: A pre‑computed 32‑entry table stores approximations of exp(x) for the clipped integer range. The integer score indexes directly into this table, yielding an integer “pseudo‑exponential”.
- Normalization: The integer pseudo‑exponentials are summed (still in int32) and each entry is scaled by a reciprocal factor using integer multiplication and a right‑shift, achieving a softmax‑like distribution without floating‑point math.
Integration – The new softmax replaces the standard FP softmax in the attention kernel. All surrounding operations (Q·Kᵀ, V·softmax) stay in the integer domain, preserving the end‑to‑end INT8 dataflow.
Implementation – Optimized assembly kernels for Armv8’s NEON SIMD units were written to keep the lookup and normalization on‑chip, minimizing memory traffic.

Results & Findings

Model (Quantized)	Baseline (FP16)	INT8‑softmax (mixed)	IntAttention	Speedup vs. FP16	Energy ↓ vs. FP16
BERT‑Base (NLU)	120 ms	78 ms	45 ms	2.7×	58 %
GPT‑2‑small	210 ms	132 ms	85 ms	2.5×	55 %
ViT‑B/16 (Vision)	95 ms	62 ms	38 ms	2.5×	61 %

Latency: The softmax portion shrank from ~65 % of total attention time to <10 % after applying IndexSoftmax.
Accuracy: Across all benchmarks, the final task accuracy (e.g., GLUE scores, ImageNet top‑1) deviated by less than 0.3 % from the FP16 reference.
Scalability: Gains held steady when scaling batch size from 1 to 8, indicating the approach works for both real‑time (batch‑1) and micro‑batch inference scenarios.

Practical Implications

Edge AI SDKs: Mobile frameworks (TensorFlow Lite, ONNX Runtime) can adopt IntAttention as a drop‑in kernel, delivering faster inference for chat‑bots, on‑device translation, and AR perception without sacrificing model quality.
Battery life: A 60 % reduction in energy per inference translates directly into longer device runtimes for continuous‑listen voice assistants or real‑time video analytics.
Hardware design: The integer‑only pipeline aligns with emerging AI accelerators that lack floating‑point units, making it easier to map Transformers onto low‑cost ASICs or microcontrollers.
Cost‑effective deployment: Companies can run larger or more frequent Transformer queries on existing commodity hardware, postponing the need for expensive cloud inference or custom silicon.

Limitations & Future Work

Lookup‑table granularity: The 32‑entry table is a trade‑off between accuracy and memory; extremely large attention heads may benefit from a finer table or adaptive scaling.
Hardware specificity: The current implementation is tuned for Armv8 NEON; porting to other ISAs (RISC‑V, x86 AVX‑512) will require additional kernel engineering.
Dynamic range handling: While sparsity‑aware clipping works well for the evaluated models, highly skewed score distributions (e.g., in some retrieval tasks) could still cause overflow or underflow, suggesting a need for adaptive clipping strategies.
Future directions: The authors plan to explore learned clipping thresholds, integrate the method into end‑to‑end quantization-aware training pipelines, and extend the approach to other non‑linear ops (e.g., GELU) to achieve a fully integer Transformer stack.

Authors

Wanli Zhong
Haibo Feng
Zirui Zhou
Hanyang Peng
Shiqi Yu

Paper Information

arXiv ID: 2511.21513v1
Categories: cs.LG
Published: November 26, 2025
PDF: Download PDF

[Paper] IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval