[Paper] Self-attention vector output similarities reveal how machines pay attention

Published: (December 26, 2025 at 05:03 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21956v1

Overview

This paper digs into the “black box” of self‑attention in transformer models—specifically BERT‑12—to uncover how attention heads actually process language. By turning the raw attention vectors into a similarity matrix, the authors show that different heads specialize in distinct linguistic cues (e.g., token repetitions, sentence boundaries) and that this specialization evolves layer‑by‑layer. The findings give developers a concrete way to interpret and even exploit attention patterns for downstream tasks such as text segmentation or token‑level diagnostics.

Key Contributions

  • Vector‑based similarity analysis: Introduces a scalar‑product similarity matrix computed from the output vectors of each self‑attention head, enabling quantitative comparison of token representations.
  • Head‑level linguistic specialization: Demonstrates that individual heads consistently focus on different linguistic phenomena (sentence separators, repeated tokens, context‑common tokens).
  • Layer‑wise evolution of similarity: Shows a clear shift from long‑range similarities in early layers to short‑range, intra‑sentence similarities in deeper layers.
  • Token‑centric clustering: Finds that each head tends to build high‑similarity pairs around a unique “anchor” token, effectively creating token‑specific neighborhoods in vector space.
  • Practical insight for segmentation: Observes that final‑layer attention maps concentrate on sentence separator tokens, suggesting a lightweight, attention‑driven method for text segmentation.

Methodology

  1. Model selection: The authors work with the pretrained BERT‑Base (12‑layer) model, extracting the output vectors of every self‑attention head for a large corpus of English sentences.
  2. Similarity matrix construction: For each head and layer, they compute the pairwise scalar product (dot product) between token vectors, yielding a context similarity matrix that quantifies how closely two tokens are represented in that head’s space.
  3. Statistical probing:
    • Distribution analysis: Histograms of similarity scores are plotted per layer to track the transition from long‑range to short‑range focus.
    • Token‑frequency profiling: The most frequent tokens among the top‑similar pairs are identified per head, revealing each head’s “anchor” token.
    • Qualitative case studies: Specific sentences are examined to illustrate how heads capture repetitions, common context tokens, and sentence delimiters.
  4. Visualization: Heatmaps of attention maps and similarity matrices are used to illustrate the spatial patterns that emerge across layers and heads.

Results & Findings

  • Sentence separator focus: In the top (final) layers, attention heads assign high similarity scores to [SEP] tokens, effectively marking sentence boundaries.
  • Head specialization:
    • Some heads highlight repeated words (e.g., “the … the”), acting like a duplication detector.
    • Others cluster tokens that appear frequently together in a local context (e.g., “bank” with “account”).
  • Layer dynamics: Early layers exhibit broad, long‑range similarity peaks, suggesting a global view of the input. As depth increases, similarity becomes sharply peaked within the same sentence, indicating a shift toward fine‑grained, local processing.
  • Unique anchor tokens: Each head tends to have a distinct most‑common token among its high‑similarity pairs, forming a token‑centric “neighborhood” that remains stable across inputs.
  • Quantitative shift: The average distance between highly similar token pairs drops by ~30 % from layer 1 to layer 12, confirming the move toward tighter, sentence‑level cohesion.

Practical Implications

  • Lightweight sentence segmentation: Since final‑layer heads naturally attend to [SEP] tokens, developers can extract these attention scores to split long documents without training a separate model.
  • Debugging & interpretability tools: The similarity matrix offers a new diagnostic view—developers can pinpoint which head is responsible for a particular linguistic pattern (e.g., detecting repeated entities) and use that insight to fine‑tune or prune models.
  • Head‑pruning strategies: Knowing that certain heads specialize in redundant or niche patterns enables smarter pruning (e.g., drop heads focused on rare repetitions to reduce compute without harming core performance).
  • Feature engineering for downstream tasks: The token‑anchor neighborhoods can be harvested as additional features for tasks like coreference resolution, keyword extraction, or domain‑specific entity linking.
  • Curriculum design for fine‑tuning: When adapting BERT to a new domain, practitioners might freeze early layers (which capture long‑range structure) and only fine‑tune later layers that handle sentence‑level nuances, aligning with the observed similarity shift.

Limitations & Future Work

  • Model scope: The study is limited to BERT‑Base; it remains unclear whether the same head‑level specializations hold for larger models (e.g., BERT‑Large, RoBERTa) or encoder‑decoder architectures like T5.
  • Language diversity: Experiments were conducted on English corpora only; cross‑lingual behavior of attention vectors may differ.
  • Static analysis: The similarity matrices are computed on frozen pretrained weights; investigating how these patterns evolve during fine‑tuning would deepen the practical relevance.
  • Application testing: While the paper proposes segmentation and debugging uses, systematic benchmarks (e.g., segmentation accuracy vs. dedicated models) are left for future work.

Bottom line: By turning attention vectors into a similarity landscape, the authors provide a tangible, quantitative lens on how transformers “pay attention.” This opens up new avenues for model interpretability, efficient engineering, and task‑specific exploitation of attention dynamics.

Authors

  • Tal Halevi
  • Yarden Tzach
  • Ronit D. Gross
  • Shalom Rosner
  • Ido Kanter

Paper Information

  • arXiv ID: 2512.21956v1
  • Categories: cs.CL
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »