Understanding Multi-Head Attention in Transformers
Source: Dev.to
Multi‑Head Attention Overview
Self‑attention lets a transformer capture relationships between words using Query, Key, and Value vectors. However, a single attention head tends to focus on only one type of relationship at a time, while natural language often contains multiple layers of structure, meaning, and long‑range dependencies simultaneously.
Multi‑head attention addresses this by applying the attention mechanism multiple times in parallel. Each parallel run is called a head, and each head has its own learned weights for Query, Key, and Value. Consequently, every head examines the same sentence but from its own perspective.
How It Works
- Prepare input embeddings – the token embeddings (plus positional encodings) are generated as usual.
- Split into heads – linear projection layers map the embeddings into h separate sub‑spaces, one for each head.
- Self‑attention per head – each head independently computes attention scores and weighted sums.
- Head outputs – every head produces its own output representation.
- Concatenate – the outputs of all heads are concatenated along the feature dimension.
- Final linear layer – a final projection mixes the concatenated vectors into a single output for the next transformer block.
What Different Heads Capture
- Word order and grammar – syntactic patterns and positional relationships.
- Nearby word relationships – local dependencies such as collocations.
- Long‑distance links – connections between words far apart in the sequence.
- Semantic/meaning‑based connections – contextual similarity and topic coherence.
Analogy
Think of a single head as reading a sentence with one specific focus. Multi‑head attention is like reading the same sentence several times, each time noticing a different aspect, and then combining those observations into a richer overall understanding.
This parallel processing enables the model to grasp language from multiple angles simultaneously, without forcing a single attention mechanism to handle every type of relationship.