Understanding Transformers Part 12: Building the Decoder Layers
Source: Dev.to
Adding Positional Encoding in the Decoder
For the decoder we add positional encoding using the same sine and cosine curves that were used for the encoder inputs.

These are the same curves that were used earlier when encoding the input.
Applying Positional Values
The “ token occupies the first position and has two embedding dimensions. We take the corresponding positional values from the curves:
- For the first embedding, the positional value is 0.
- For the second embedding, the positional value is 1.
Adding these positional values to the original embeddings yields:

The result is 2.70 and ‑0.34, representing the “ token after positional encoding.
Adding Self‑Attention
Next we insert a self‑attention layer so the decoder can capture relationships between output tokens.

For the “ token the self‑attention outputs are ‑2.8 and ‑2.3.
Note that the weights used for queries, keys, and values in the decoder’s self‑attention are distinct from those in the encoder.
Adding Residual Connections
As with the encoder, we now add residual connections around the self‑attention sub‑layer.

What’s Next?
So far we have seen how self‑attention enables the transformer to understand relationships within the output sentence.
For tasks such as translation, the model must also capture relationships between the input sentence and the output sentence. We will explore this cross‑attention mechanism in the next article.