Understanding Transformers Part 12: Building the Decoder Layers

Published: (April 23, 2026 at 03:23 PM EDT)
2 min read
Source: Dev.to

Source: Dev.to

Adding Positional Encoding in the Decoder

For the decoder we add positional encoding using the same sine and cosine curves that were used for the encoder inputs.

Positional encoding curves

These are the same curves that were used earlier when encoding the input.

Applying Positional Values

The “ token occupies the first position and has two embedding dimensions. We take the corresponding positional values from the curves:

  • For the first embedding, the positional value is 0.
  • For the second embedding, the positional value is 1.

Adding these positional values to the original embeddings yields:

EOS after positional encoding

The result is 2.70 and ‑0.34, representing the “ token after positional encoding.

Adding Self‑Attention

Next we insert a self‑attention layer so the decoder can capture relationships between output tokens.

Self‑attention in decoder

For the “ token the self‑attention outputs are ‑2.8 and ‑2.3.
Note that the weights used for queries, keys, and values in the decoder’s self‑attention are distinct from those in the encoder.

Adding Residual Connections

As with the encoder, we now add residual connections around the self‑attention sub‑layer.

Residual connections in decoder

What’s Next?

So far we have seen how self‑attention enables the transformer to understand relationships within the output sentence.
For tasks such as translation, the model must also capture relationships between the input sentence and the output sentence. We will explore this cross‑attention mechanism in the next article.

0 views
Back to Blog

Related posts

Read more »