[Paper] Avey-B

Published: 2 months ago (February 17, 2026 at 01:50 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15814v1

Overview

The paper “Avey‑B” revisits the recently proposed Avey model—originally an autoregressive, attention‑free architecture—and adapts it for the encoder‑only setting that powers most modern NLP services (think BERT‑style models). By redesigning the core building blocks, the authors demonstrate that an attention‑free encoder can match or beat traditional Transformers on token‑level tasks while using less compute and handling longer sequences more gracefully.

Key Contributions

Encoder‑only reformulation of Avey – a novel way to turn the autoregressive Avey into a bidirectional encoder without sacrificing its attention‑free nature.
Decoupled static & dynamic parameterizations – separates parameters that stay constant across inputs from those that adapt per‑token, improving efficiency and scalability.
Stability‑oriented normalization – introduces a new normalization scheme that mitigates training instabilities common in deep, non‑attention models.
Neural compression module – a lightweight compression layer that reduces memory footprint while preserving representational power.
Comprehensive empirical evaluation – Avey‑B outperforms four popular Transformer encoders (e.g., BERT‑base, RoBERTa, DistilBERT, and ALBERT) on token‑classification (NER, POS) and information‑retrieval benchmarks, especially as context length grows.

Methodology

Architecture redesign – The original Avey uses a causal, feed‑forward stack. The authors convert it into a bidirectional encoder by processing the input sequence in both forward and backward passes and then merging the representations.
Static vs. dynamic weights – Static weights are shared across all positions (similar to convolution kernels), while dynamic weights are generated on‑the‑fly from a lightweight gating network, allowing the model to adapt to each token without the quadratic cost of self‑attention.
Normalization – Instead of LayerNorm, they apply a scaled cosine normalization that keeps activation magnitudes stable across deep stacks, which is crucial for training attention‑free networks.
Neural compression – A bottleneck projection (learned linear compression) reduces the hidden dimension before the final classification head, cutting memory usage by up to 30 % with negligible accuracy loss.
Training regime – Standard masked language modeling (MLM) objectives are used, plus a small auxiliary next‑sentence prediction task to encourage cross‑sentence context awareness.

Results & Findings

Benchmark	Avey‑B	BERT‑base	RoBERTa‑base	DistilBERT	ALBERT‑base
CoNLL‑2003 NER (F1)	92.1	90.8	91.2	89.5	90.5
POS‑Tagging (Acc)	98.3	97.9	98.0	97.5	97.8
MS‑MARCO Retrieval (MRR@10)	0.376	0.352	0.361	0.340	0.355
Sequence length 512 (latency)	0.68× of BERT	1.0×	1.0×	0.85×	0.92×

Accuracy: Avey‑B consistently matches or exceeds Transformer baselines on token‑level tasks.
Efficiency: Because it avoids quadratic self‑attention, inference time grows linearly with sequence length, giving a ~30 % speedup at 512 tokens and even larger gains at >1k tokens.
Memory: The compression layer reduces peak GPU memory by ~30 %, enabling deployment on edge devices or low‑cost cloud instances.

Practical Implications

Long‑document processing – Industries that need to analyze contracts, research papers, or logs (often >1k tokens) can now use a compact encoder without the prohibitive memory cost of Transformers.
Edge & mobile NLP – The reduced parameter count and memory footprint make Avey‑B a strong candidate for on‑device inference (e.g., smart assistants, AR translation).
Cost‑effective serving – Cloud providers can serve more concurrent requests per GPU, lowering operational expenses for services like semantic search or real‑time NER.
Simplified scaling – Linear scaling with sequence length simplifies hardware provisioning; developers no longer need to shard or truncate inputs aggressively.

Limitations & Future Work

Pre‑training data – The authors pre‑trained Avey‑B on a relatively modest corpus (≈16 B tokens). Scaling up to the massive datasets used for BERT‑large may reveal hidden bottlenecks.
Task diversity – Evaluation focused on token‑classification and retrieval; generative or sequence‑to‑sequence tasks (e.g., translation, summarization) remain untested.
Interpretability – Without explicit attention maps, diagnosing model behavior can be harder; future work could explore visualization tools for the dynamic gating mechanisms.
Hybrid designs – The paper hints at combining attention‑free layers with occasional attention heads to capture global dependencies—an avenue worth exploring for even richer representations.

Authors

Devang Acharya
Mohammad Hammoud

Paper Information

arXiv ID: 2602.15814v1
Categories: cs.CL, cs.AI
Published: February 17, 2026
PDF: Download PDF

[Paper] Avey-B

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

You're probably using Agent Skills wrong

[Paper] VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning

[Paper] RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

[Paper] SPQ: An Ensemble Technique for Large Language Model Compression