[Paper] Self-attention vector output similarities reveal how machines pay attention

Published: 1 month ago (December 26, 2025 at 05:03 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21956v1

Overview

This paper digs into the “black box” of self‑attention in transformer models—specifically BERT‑12—to uncover how attention heads actually process language. By turning the raw attention vectors into a similarity matrix, the authors show that different heads specialize in distinct linguistic cues (e.g., token repetitions, sentence boundaries) and that this specialization evolves layer‑by‑layer. The findings give developers a concrete way to interpret and even exploit attention patterns for downstream tasks such as text segmentation or token‑level diagnostics.

Key Contributions

Vector‑based similarity analysis: Introduces a scalar‑product similarity matrix computed from the output vectors of each self‑attention head, enabling quantitative comparison of token representations.
Head‑level linguistic specialization: Demonstrates that individual heads consistently focus on different linguistic phenomena (sentence separators, repeated tokens, context‑common tokens).
Layer‑wise evolution of similarity: Shows a clear shift from long‑range similarities in early layers to short‑range, intra‑sentence similarities in deeper layers.
Token‑centric clustering: Finds that each head tends to build high‑similarity pairs around a unique “anchor” token, effectively creating token‑specific neighborhoods in vector space.
Practical insight for segmentation: Observes that final‑layer attention maps concentrate on sentence separator tokens, suggesting a lightweight, attention‑driven method for text segmentation.

Methodology

Model selection: The authors work with the pretrained BERT‑Base (12‑layer) model, extracting the output vectors of every self‑attention head for a large corpus of English sentences.
Similarity matrix construction: For each head and layer, they compute the pairwise scalar product (dot product) between token vectors, yielding a context similarity matrix that quantifies how closely two tokens are represented in that head’s space.
Statistical probing:
- Distribution analysis: Histograms of similarity scores are plotted per layer to track the transition from long‑range to short‑range focus.
- Token‑frequency profiling: The most frequent tokens among the top‑similar pairs are identified per head, revealing each head’s “anchor” token.
- Qualitative case studies: Specific sentences are examined to illustrate how heads capture repetitions, common context tokens, and sentence delimiters.
Visualization: Heatmaps of attention maps and similarity matrices are used to illustrate the spatial patterns that emerge across layers and heads.

Results & Findings

Sentence separator focus: In the top (final) layers, attention heads assign high similarity scores to [SEP] tokens, effectively marking sentence boundaries.
Head specialization:
- Some heads highlight repeated words (e.g., “the … the”), acting like a duplication detector.
- Others cluster tokens that appear frequently together in a local context (e.g., “bank” with “account”).
Layer dynamics: Early layers exhibit broad, long‑range similarity peaks, suggesting a global view of the input. As depth increases, similarity becomes sharply peaked within the same sentence, indicating a shift toward fine‑grained, local processing.
Unique anchor tokens: Each head tends to have a distinct most‑common token among its high‑similarity pairs, forming a token‑centric “neighborhood” that remains stable across inputs.
Quantitative shift: The average distance between highly similar token pairs drops by ~30 % from layer 1 to layer 12, confirming the move toward tighter, sentence‑level cohesion.

Practical Implications

Lightweight sentence segmentation: Since final‑layer heads naturally attend to [SEP] tokens, developers can extract these attention scores to split long documents without training a separate model.
Debugging & interpretability tools: The similarity matrix offers a new diagnostic view—developers can pinpoint which head is responsible for a particular linguistic pattern (e.g., detecting repeated entities) and use that insight to fine‑tune or prune models.
Head‑pruning strategies: Knowing that certain heads specialize in redundant or niche patterns enables smarter pruning (e.g., drop heads focused on rare repetitions to reduce compute without harming core performance).
Feature engineering for downstream tasks: The token‑anchor neighborhoods can be harvested as additional features for tasks like coreference resolution, keyword extraction, or domain‑specific entity linking.
Curriculum design for fine‑tuning: When adapting BERT to a new domain, practitioners might freeze early layers (which capture long‑range structure) and only fine‑tune later layers that handle sentence‑level nuances, aligning with the observed similarity shift.

Limitations & Future Work

Model scope: The study is limited to BERT‑Base; it remains unclear whether the same head‑level specializations hold for larger models (e.g., BERT‑Large, RoBERTa) or encoder‑decoder architectures like T5.
Language diversity: Experiments were conducted on English corpora only; cross‑lingual behavior of attention vectors may differ.
Static analysis: The similarity matrices are computed on frozen pretrained weights; investigating how these patterns evolve during fine‑tuning would deepen the practical relevance.
Application testing: While the paper proposes segmentation and debugging uses, systematic benchmarks (e.g., segmentation accuracy vs. dedicated models) are left for future work.

Bottom line: By turning attention vectors into a similarity landscape, the authors provide a tangible, quantitative lens on how transformers “pay attention.” This opens up new avenues for model interpretability, efficient engineering, and task‑specific exploitation of attention dynamics.

Authors

Tal Halevi
Yarden Tzach
Ronit D. Gross
Shalom Rosner
Ido Kanter

Paper Information

arXiv ID: 2512.21956v1
Categories: cs.CL
Published: December 26, 2025
PDF: Download PDF

[Paper] Self-attention vector output similarities reveal how machines pay attention

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents