[Paper] Self-attention vector output similarities reveal how machines pay attention
Source: arXiv - 2512.21956v1
Overview
This paper digs into the “black box” of self‑attention in transformer models—specifically BERT‑12—to uncover how attention heads actually process language. By turning the raw attention vectors into a similarity matrix, the authors show that different heads specialize in distinct linguistic cues (e.g., token repetitions, sentence boundaries) and that this specialization evolves layer‑by‑layer. The findings give developers a concrete way to interpret and even exploit attention patterns for downstream tasks such as text segmentation or token‑level diagnostics.
Key Contributions
- Vector‑based similarity analysis: Introduces a scalar‑product similarity matrix computed from the output vectors of each self‑attention head, enabling quantitative comparison of token representations.
- Head‑level linguistic specialization: Demonstrates that individual heads consistently focus on different linguistic phenomena (sentence separators, repeated tokens, context‑common tokens).
- Layer‑wise evolution of similarity: Shows a clear shift from long‑range similarities in early layers to short‑range, intra‑sentence similarities in deeper layers.
- Token‑centric clustering: Finds that each head tends to build high‑similarity pairs around a unique “anchor” token, effectively creating token‑specific neighborhoods in vector space.
- Practical insight for segmentation: Observes that final‑layer attention maps concentrate on sentence separator tokens, suggesting a lightweight, attention‑driven method for text segmentation.
Methodology
- Model selection: The authors work with the pretrained BERT‑Base (12‑layer) model, extracting the output vectors of every self‑attention head for a large corpus of English sentences.
- Similarity matrix construction: For each head and layer, they compute the pairwise scalar product (dot product) between token vectors, yielding a context similarity matrix that quantifies how closely two tokens are represented in that head’s space.
- Statistical probing:
- Distribution analysis: Histograms of similarity scores are plotted per layer to track the transition from long‑range to short‑range focus.
- Token‑frequency profiling: The most frequent tokens among the top‑similar pairs are identified per head, revealing each head’s “anchor” token.
- Qualitative case studies: Specific sentences are examined to illustrate how heads capture repetitions, common context tokens, and sentence delimiters.
- Visualization: Heatmaps of attention maps and similarity matrices are used to illustrate the spatial patterns that emerge across layers and heads.
Results & Findings
- Sentence separator focus: In the top (final) layers, attention heads assign high similarity scores to
[SEP]tokens, effectively marking sentence boundaries. - Head specialization:
- Some heads highlight repeated words (e.g., “the … the”), acting like a duplication detector.
- Others cluster tokens that appear frequently together in a local context (e.g., “bank” with “account”).
- Layer dynamics: Early layers exhibit broad, long‑range similarity peaks, suggesting a global view of the input. As depth increases, similarity becomes sharply peaked within the same sentence, indicating a shift toward fine‑grained, local processing.
- Unique anchor tokens: Each head tends to have a distinct most‑common token among its high‑similarity pairs, forming a token‑centric “neighborhood” that remains stable across inputs.
- Quantitative shift: The average distance between highly similar token pairs drops by ~30 % from layer 1 to layer 12, confirming the move toward tighter, sentence‑level cohesion.
Practical Implications
- Lightweight sentence segmentation: Since final‑layer heads naturally attend to
[SEP]tokens, developers can extract these attention scores to split long documents without training a separate model. - Debugging & interpretability tools: The similarity matrix offers a new diagnostic view—developers can pinpoint which head is responsible for a particular linguistic pattern (e.g., detecting repeated entities) and use that insight to fine‑tune or prune models.
- Head‑pruning strategies: Knowing that certain heads specialize in redundant or niche patterns enables smarter pruning (e.g., drop heads focused on rare repetitions to reduce compute without harming core performance).
- Feature engineering for downstream tasks: The token‑anchor neighborhoods can be harvested as additional features for tasks like coreference resolution, keyword extraction, or domain‑specific entity linking.
- Curriculum design for fine‑tuning: When adapting BERT to a new domain, practitioners might freeze early layers (which capture long‑range structure) and only fine‑tune later layers that handle sentence‑level nuances, aligning with the observed similarity shift.
Limitations & Future Work
- Model scope: The study is limited to BERT‑Base; it remains unclear whether the same head‑level specializations hold for larger models (e.g., BERT‑Large, RoBERTa) or encoder‑decoder architectures like T5.
- Language diversity: Experiments were conducted on English corpora only; cross‑lingual behavior of attention vectors may differ.
- Static analysis: The similarity matrices are computed on frozen pretrained weights; investigating how these patterns evolve during fine‑tuning would deepen the practical relevance.
- Application testing: While the paper proposes segmentation and debugging uses, systematic benchmarks (e.g., segmentation accuracy vs. dedicated models) are left for future work.
Bottom line: By turning attention vectors into a similarity landscape, the authors provide a tangible, quantitative lens on how transformers “pay attention.” This opens up new avenues for model interpretability, efficient engineering, and task‑specific exploitation of attention dynamics.
Authors
- Tal Halevi
- Yarden Tzach
- Ronit D. Gross
- Shalom Rosner
- Ido Kanter
Paper Information
- arXiv ID: 2512.21956v1
- Categories: cs.CL
- Published: December 26, 2025
- PDF: Download PDF