BIG STEPS TO TRANSFORMER (PART 2): BUILDING THE TRANSFORMER

Published: 5 days ago (December 6, 2025 at 05:02 AM EST)

4 min read

Source: Dev.to

The Naive Approach

Let’s be specific: for each timestep we want to see every character behind us in order to make our decision.
A simple way is to carry the data of the previous characters with us by a weighted sum; in the first case we just take the mean.

B, T, C = 4, 8, 2
x = torch.randn(B, T, C)
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]          # (t, C)
        xbow[b, t] = torch.mean(xprev, 0)

All of those operations can be expressed with matrix multiplication. Below is a compact version using a lower‑triangular mask to enforce causality (no token can attend to future tokens) and normalizing each row to compute the mean.

torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)   # row‑wise mean
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

The lower‑triangular matrix (torch.tril) ensures that each token only looks at the past. Dividing each row by its sum gives the mean of the aggregated values.

Applying the same idea to our real input x:

wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)   # normalize rows
xbow2 = wei @ x                         # (B, T, T) @ (B, T, C) -> (B, T, C)
torch.allclose(xbow, xbow2)             # should be True

In practice we replace the explicit mean with a softmax over the masked scores:

wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))   # mask future positions
wei = F.softmax(wei, dim=-1)                      # turn scores into probabilities
xbow3 = wei @ x
torch.allclose(xbow, xbow3)                       # verifies equivalence

We use -inf for masked entries so that softmax assigns them zero probability (since exp(-inf) = 0).

Positional Embedding

Self‑attention alone is permutation‑invariant, so we need to inject information about token positions. A common approach is to add a positional embedding to the token embeddings:

# Example (pseudo‑code)
pos_emb = posemb_matrix(torch.arange(T))
x = token_emb + pos_emb

The Crux of Self‑Attention

The naive approach treats all previous tokens as equal contributors. In reality, some tokens are more relevant than others, so we replace the uniform average with a weighted sum.

Each token is projected into three vectors:

Query (Q) – what the token is looking for.
Key (K) – how each token can be matched against queries.
Value (V) – the information that will be aggregated.

The attention weight between a query and a key is computed by a dot product (similarity), followed by a softmax to obtain a probability distribution. The weighted sum of the value vectors yields the output for that token.

A helpful visual explanation can be found in 3Blue1Brown’s video: Attention in transformers.

Implementation

head_size = 16

# Linear projections (no bias)
key   = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

# Project input x (shape: B, T, C)
k = key(x)    # (B, T, head_size)
q = query(x)  # (B, T, head_size)

# Compute raw attention scores
wei = q @ k.transpose(-2, -1)               # (B, T, T)

# Causal mask: prevent attending to future tokens
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))

# Softmax over the last dimension to get attention probabilities
wei = F.softmax(wei, dim=-1)

# Project to values and aggregate
v = value(x)                               # (B, T, head_size)
out = wei @ v                               # (B, T, head_size)

Key points

k.transpose(-2, -1) aligns the key vectors for dot‑product computation with queries.
The causal mask (tril) ensures each position only attends to earlier positions.
The softmax turns raw scores into a proper weighting distribution.
The final output out is a weighted sum of the value vectors.

Notes

The naive equal‑weight averaging is a special case of attention where all attention scores are identical.
In practice, multiple attention heads are used, each with its own Q, K, V projections, and their outputs are concatenated.
Layer normalization, residual connections, and feed‑forward networks are added around the attention block to form a full transformer layer (see Andrej Karpathy’s “nanoGPT” implementation for a minimal working example).

BIG STEPS TO TRANSFORMER (PART 2): BUILDING THE TRANSFORMER

The Naive Approach

Positional Embedding

The Crux of Self‑Attention

Implementation

Notes

Related posts

Prompt Length vs. Context Window: The Real Limits Behind LLM Performance

Beyond the Black Box: Neuro‑Symbolic AI, Metacognition, and the Next Leap in Machine Intelligence

[Paper] Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach

[Paper] SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments