[Paper] Avey-B
Source: arXiv - 2602.15814v1
Overview
The paper “Avey‑B” revisits the recently proposed Avey model—originally an autoregressive, attention‑free architecture—and adapts it for the encoder‑only setting that powers most modern NLP services (think BERT‑style models). By redesigning the core building blocks, the authors demonstrate that an attention‑free encoder can match or beat traditional Transformers on token‑level tasks while using less compute and handling longer sequences more gracefully.
Key Contributions
- Encoder‑only reformulation of Avey – a novel way to turn the autoregressive Avey into a bidirectional encoder without sacrificing its attention‑free nature.
- Decoupled static & dynamic parameterizations – separates parameters that stay constant across inputs from those that adapt per‑token, improving efficiency and scalability.
- Stability‑oriented normalization – introduces a new normalization scheme that mitigates training instabilities common in deep, non‑attention models.
- Neural compression module – a lightweight compression layer that reduces memory footprint while preserving representational power.
- Comprehensive empirical evaluation – Avey‑B outperforms four popular Transformer encoders (e.g., BERT‑base, RoBERTa, DistilBERT, and ALBERT) on token‑classification (NER, POS) and information‑retrieval benchmarks, especially as context length grows.
Methodology
- Architecture redesign – The original Avey uses a causal, feed‑forward stack. The authors convert it into a bidirectional encoder by processing the input sequence in both forward and backward passes and then merging the representations.
- Static vs. dynamic weights – Static weights are shared across all positions (similar to convolution kernels), while dynamic weights are generated on‑the‑fly from a lightweight gating network, allowing the model to adapt to each token without the quadratic cost of self‑attention.
- Normalization – Instead of LayerNorm, they apply a scaled cosine normalization that keeps activation magnitudes stable across deep stacks, which is crucial for training attention‑free networks.
- Neural compression – A bottleneck projection (learned linear compression) reduces the hidden dimension before the final classification head, cutting memory usage by up to 30 % with negligible accuracy loss.
- Training regime – Standard masked language modeling (MLM) objectives are used, plus a small auxiliary next‑sentence prediction task to encourage cross‑sentence context awareness.
Results & Findings
| Benchmark | Avey‑B | BERT‑base | RoBERTa‑base | DistilBERT | ALBERT‑base |
|---|---|---|---|---|---|
| CoNLL‑2003 NER (F1) | 92.1 | 90.8 | 91.2 | 89.5 | 90.5 |
| POS‑Tagging (Acc) | 98.3 | 97.9 | 98.0 | 97.5 | 97.8 |
| MS‑MARCO Retrieval (MRR@10) | 0.376 | 0.352 | 0.361 | 0.340 | 0.355 |
| Sequence length 512 (latency) | 0.68× of BERT | 1.0× | 1.0× | 0.85× | 0.92× |
- Accuracy: Avey‑B consistently matches or exceeds Transformer baselines on token‑level tasks.
- Efficiency: Because it avoids quadratic self‑attention, inference time grows linearly with sequence length, giving a ~30 % speedup at 512 tokens and even larger gains at >1k tokens.
- Memory: The compression layer reduces peak GPU memory by ~30 %, enabling deployment on edge devices or low‑cost cloud instances.
Practical Implications
- Long‑document processing – Industries that need to analyze contracts, research papers, or logs (often >1k tokens) can now use a compact encoder without the prohibitive memory cost of Transformers.
- Edge & mobile NLP – The reduced parameter count and memory footprint make Avey‑B a strong candidate for on‑device inference (e.g., smart assistants, AR translation).
- Cost‑effective serving – Cloud providers can serve more concurrent requests per GPU, lowering operational expenses for services like semantic search or real‑time NER.
- Simplified scaling – Linear scaling with sequence length simplifies hardware provisioning; developers no longer need to shard or truncate inputs aggressively.
Limitations & Future Work
- Pre‑training data – The authors pre‑trained Avey‑B on a relatively modest corpus (≈16 B tokens). Scaling up to the massive datasets used for BERT‑large may reveal hidden bottlenecks.
- Task diversity – Evaluation focused on token‑classification and retrieval; generative or sequence‑to‑sequence tasks (e.g., translation, summarization) remain untested.
- Interpretability – Without explicit attention maps, diagnosing model behavior can be harder; future work could explore visualization tools for the dynamic gating mechanisms.
- Hybrid designs – The paper hints at combining attention‑free layers with occasional attention heads to capture global dependencies—an avenue worth exploring for even richer representations.
Authors
- Devang Acharya
- Mohammad Hammoud
Paper Information
- arXiv ID: 2602.15814v1
- Categories: cs.CL, cs.AI
- Published: February 17, 2026
- PDF: Download PDF