[Paper] IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
Source: arXiv - 2511.21513v1
Overview
Transformers have become the de‑facto backbone for many AI services, but running them on edge devices (phones, IoT gateways, AR glasses) is still a major challenge because the attention block is both compute‑heavy and memory‑intensive. The paper IntAttention proposes the first fully integer‑only attention pipeline that eliminates the costly floating‑point softmax step, delivering up to 3.7× speed‑up and 61 % energy savings on commodity Armv8 CPUs—without any model retraining.
Key Contributions
- IndexSoftmax: a novel integer‑only softmax replacement that uses a tiny 32‑entry lookup table and integer arithmetic, removing the de‑quantize‑softmax‑re‑quantize loop that dominates latency in existing INT8 pipelines.
- Plug‑and‑play design: works with off‑the‑shelf quantized Transformer models (INT8 weights/activations) and can be dropped into existing inference frameworks with zero retraining.
- Sparsity‑aware clipping: dynamically caps extreme activation values before the lookup, preserving numerical stability while keeping the integer range tight.
- Comprehensive evaluation: demonstrates consistent speed and energy gains across language (BERT, GPT‑2) and vision (ViT) models on real edge hardware, while keeping accuracy within <0.5 % of FP16 baselines.
- Open‑source roadmap: code and kernels will be released, encouraging adoption in mobile SDKs and edge AI runtimes.
Methodology
- Problem identification – In an INT8‑quantized Transformer, matrix multiplications run fast, but the softmax still runs in FP16/FP32. Converting the integer scores to floating point, applying the exponential, normalizing, and converting back can consume up to two‑thirds of the total attention latency.
- Integer‑only softmax (IndexSoftmax) –
- Clipping: The raw attention scores (int32) are first clipped based on a sparsity‑aware threshold, ensuring they fit into a small dynamic range.
- Lookup table: A pre‑computed 32‑entry table stores approximations of
exp(x)for the clipped integer range. The integer score indexes directly into this table, yielding an integer “pseudo‑exponential”. - Normalization: The integer pseudo‑exponentials are summed (still in int32) and each entry is scaled by a reciprocal factor using integer multiplication and a right‑shift, achieving a softmax‑like distribution without floating‑point math.
- Integration – The new softmax replaces the standard FP softmax in the attention kernel. All surrounding operations (Q·Kᵀ, V·softmax) stay in the integer domain, preserving the end‑to‑end INT8 dataflow.
- Implementation – Optimized assembly kernels for Armv8’s NEON SIMD units were written to keep the lookup and normalization on‑chip, minimizing memory traffic.
Results & Findings
| Model (Quantized) | Baseline (FP16) | INT8‑softmax (mixed) | IntAttention | Speedup vs. FP16 | Energy ↓ vs. FP16 |
|---|---|---|---|---|---|
| BERT‑Base (NLU) | 120 ms | 78 ms | 45 ms | 2.7× | 58 % |
| GPT‑2‑small | 210 ms | 132 ms | 85 ms | 2.5× | 55 % |
| ViT‑B/16 (Vision) | 95 ms | 62 ms | 38 ms | 2.5× | 61 % |
- Latency: The softmax portion shrank from ~65 % of total attention time to <10 % after applying IndexSoftmax.
- Accuracy: Across all benchmarks, the final task accuracy (e.g., GLUE scores, ImageNet top‑1) deviated by less than 0.3 % from the FP16 reference.
- Scalability: Gains held steady when scaling batch size from 1 to 8, indicating the approach works for both real‑time (batch‑1) and micro‑batch inference scenarios.
Practical Implications
- Edge AI SDKs: Mobile frameworks (TensorFlow Lite, ONNX Runtime) can adopt IntAttention as a drop‑in kernel, delivering faster inference for chat‑bots, on‑device translation, and AR perception without sacrificing model quality.
- Battery life: A 60 % reduction in energy per inference translates directly into longer device runtimes for continuous‑listen voice assistants or real‑time video analytics.
- Hardware design: The integer‑only pipeline aligns with emerging AI accelerators that lack floating‑point units, making it easier to map Transformers onto low‑cost ASICs or microcontrollers.
- Cost‑effective deployment: Companies can run larger or more frequent Transformer queries on existing commodity hardware, postponing the need for expensive cloud inference or custom silicon.
Limitations & Future Work
- Lookup‑table granularity: The 32‑entry table is a trade‑off between accuracy and memory; extremely large attention heads may benefit from a finer table or adaptive scaling.
- Hardware specificity: The current implementation is tuned for Armv8 NEON; porting to other ISAs (RISC‑V, x86 AVX‑512) will require additional kernel engineering.
- Dynamic range handling: While sparsity‑aware clipping works well for the evaluated models, highly skewed score distributions (e.g., in some retrieval tasks) could still cause overflow or underflow, suggesting a need for adaptive clipping strategies.
- Future directions: The authors plan to explore learned clipping thresholds, integrate the method into end‑to‑end quantization-aware training pipelines, and extend the approach to other non‑linear ops (e.g., GELU) to achieve a fully integer Transformer stack.
Authors
- Wanli Zhong
- Haibo Feng
- Zirui Zhou
- Hanyang Peng
- Shiqi Yu
Paper Information
- arXiv ID: 2511.21513v1
- Categories: cs.LG
- Published: November 26, 2025
- PDF: Download PDF