Attention Is All You Need — Full Paper Breakdown

Published: (March 7, 2026 at 05:57 PM EST)
5 min read
Source: Dev.to

Source: Dev.to

The 2017 paper “Attention Is All You Need”

Vaswani et al. introduced the Transformer – the architecture behind GPT, Claude, Gemini, and every major LLM today. It replaced recurrent models entirely with attention mechanisms, and the field has never looked back. This post walks through the key ideas.

The problem with RNNs

Before Transformers, sequence modeling meant RNNs and LSTMs. These process tokens one‑at‑a‑time, left‑to‑right, which creates two major issues:

  1. No parallelization – each step depends on the previous hidden state, so tokens cannot be processed simultaneously during training.
  2. Long‑range dependencies decay – by the time an RNN reaches token 500, the signal from token 1 has been compressed through hundreds of hidden states.

Attention mechanisms existed earlier (e.g., Bahdanau attention, 2014), but they were bolted onto RNNs. The radical idea of the Transformer: what if attention is all you need? – drop recurrence entirely.

Encoder‑decoder structure

The Transformer follows the classic encoder‑decoder architecture used in machine translation:

ComponentRole# of identical layers
Encoder (left side)Takes the input sequence and produces a rich representation6
Decoder (right side)Takes the encoder’s output + previously generated tokens to produce the next token6

Each layer in both stacks contains the same building blocks:

  • Multi‑head attention
  • Feed‑forward network (FFN)
  • Residual connection
  • Layer normalization

Self‑attention

Self‑attention lets every token look at every other token and decide how much to “pay attention” to it.

For each token the model computes three vectors:

VectorIntuition
Query (Q)“What am I looking for?”
Key (K)“What do I contain?”
Value (V)“What information do I provide?”

These are obtained by multiplying the input embeddings by learned weight matrices (W_Q, W_K, W_V):

[ Q = XW_Q,\qquad K = XW_K,\qquad V = XW_V ]

The attention score between two tokens is the dot product of the query of one with the key of the other.
The scaled‑dot‑product attention formula is

[ \text{Attention}(Q,K,V)=\text{softmax}!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V ]

The scaling factor (\sqrt{d_k}) prevents the dot products from growing too large as dimensionality increases; without it the softmax would become overly peaked, killing gradients.

Multi‑head attention

Instead of computing attention once with the full dimensionality, the model splits (Q, K, V) into multiple heads (8 in the original paper). Each head works in a sub‑space of size

[ \frac{d_{\text{model}}}{h}= \frac{512}{8}=64 ]

Why multiple heads? Different heads can learn different types of relationships:

  • Head 1 – syntactic structure (e.g., subject‑verb agreement)
  • Head 2 – positional proximity
  • Head 3 – semantic similarity

The outputs of all heads are concatenated and projected back to the full dimension.

The paper uses multi‑head attention in three distinct ways:

  1. Encoder self‑attention – every input token attends to every other input token.
  2. Masked decoder self‑attention – each output token attends only to previous output tokens (the mask prevents looking ahead, preserving autoregressive generation).
  3. Cross‑attention – decoder tokens attend to encoder outputs, linking the input representation to the output generation.

Positional encodings

Self‑attention alone has no notion of order; it treats a sequence as a set. To inject order information, the Transformer adds positional encodings to the input embeddings (they are added, not concatenated).

The sinusoidal encodings are defined as

[ \text{PE}{(pos,2i)} = \sin!\left(\frac{pos}{10000^{2i/d{\text{model}}}}\right) ]

[ \text{PE}{(pos,2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d{\text{model}}}}\right) ]

These functions allow the model to generalize to sequence lengths longer than those seen during training, because any relative position can be expressed as a linear function of the encodings.

Feed‑forward network (FFN)

Each attention sub‑layer is followed by a position‑wise FFN applied independently to each token:

[ \text{FFN}(x)=\max(0, xW_1+b_1)W_2+b_2 ]

  • Two linear transformations with a ReLU in between.
  • Inner dimension expands to 2048 (4 × the model dimension of 512) and then projects back down.

Residual connections & layer normalization

Every sub‑layer (attention or FFN) is wrapped as

[ \text{LayerNorm}\bigl(x + \text{SubLayer}(x)\bigr) ]

The residual connection (x + \text{SubLayer}(x)) facilitates gradient flow through deep stacks, while layer normalization stabilizes activations.

Training details

ComponentSetting
OptimizerAdam with (\beta_1=0.9,\ \beta_2=0.98)
Learning‑rate scheduleWarm‑up for 4000 steps (linear increase) → decay proportional to (\text{step}^{-0.5})
RegularizationDropout 0.1 on attention weights and after each sub‑layer; label smoothing 0.1
Training dataWMT English‑German (4.5 M sentence pairs) and English‑French (36 M pairs)
Hardware8 × NVIDIA P100 GPUs, ~3.5 days for the large model

The Transformer achieved state‑of‑the‑art results on English‑to‑German and English‑to‑French translation, beating all previous models (including deep ensembles) while training significantly faster thanks to full parallelization.

Beyond translation

The architecture proved to be a foundation for many later models:

  • BERT – encoder‑only, bidirectional pre‑training.
  • GPT – decoder‑only, autoregressive language modeling.

…and countless other variants that dominate modern NLP and multimodal AI.

# Modeling

## Vision Transformers — applying the same architecture to images  
*Basically everything in modern AI*

The paper's core insight is elegant: you don't need recurrence or convolutions for sequence modeling.  
Attention alone — properly scaled, split into multiple heads, and stacked with residual connections — is sufficient.  

Because attention computes all pairwise relationships in parallel, it's dramatically faster to train.  

That's why, nine years later, every frontier model is still a **Transformer** at its core.
0 views
Back to Blog

Related posts

Read more »

LLM Writing Tropes.md

Article URL: https://tropes.fyi/tropes-md Comments URL: https://news.ycombinator.com/item?id=47291513 Points: 82 Comments: 34...