LLM Study Diary #1: Transformer
Source: Dev.to
About Me
I have been working as a software engineer for almost 8 years, mostly backend and infra, including distributed systems, near‑line processing, batch processing, etc. I have some basic knowledge of ML from school but no complicated ML use‑case experience. This series will note what I learn about LLMs as a general software engineer. Feel free to comment if anything seems wrong and leave your questions.
Transformer
A good source to understand each component in the transformer is the Hugging Face blog post Mastering Tensor Dimensions in Transformers.
- Decoder‑only models (GPT family, LLaMA, Claude) are used for generation.
- Encoder‑decoder models (BART, the original “Attention Is All You Need” transformer) handle translation and summarization.
- Encoder‑only models (BERT) are used for classification and embeddings.
Here we focus on decoder‑only LLMs. To summarize the architecture, the transformer block has two main components: Masked Multi‑Head Attention (MMHA) and Feed‑Forward Network (FFN).

Masked Multi‑Head Attention (MMHA)
The attention formula contains query (Q), key (K) and value (V):
\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V
- (Q) (Query) – what the current token is looking for.
- (K) (Key) – what each token offers / represents.
- (V) (Value) – the actual content to retrieve.
In the attention weight calculation, the softmax produces weights between (Q) and (K); the weighted sum of the values (V) is then returned.
Training intuition: every token asks a question (produces a (Q)) about every other token (uses their (K/V)), with a causal mask applied.
Inference intuition: only the newest token produces a (Q); it attends to the stored (K/V) from previous tokens.
Q/K/V in Training vs Inference
-
Training: the full sequence is available, so all token positions are processed in parallel. For each transformer layer, (Q), (K) and (V) are computed from the same hidden‑state sequence, and a causal mask prevents attention to future tokens.
-
Inference consists of two phases:
-
Prefill phase – the whole prompt is processed. (Q), (K) and (V) are computed for the prompt tokens; the model caches only the (K/V) pairs for future use. The (Q) vectors are discarded after the forward pass.
-
Decode / generation phase – for each newly generated token, the model computes a fresh (Q) (and its own (K/V)). The new (Q) attends to the cached (K/V) from the prompt and previously generated tokens. The new token’s (K/V) are then appended to the cache.
-
KV Caching in Inference
The same author explains KV caching in KV Caching Explained: Optimizing Transformer Inference Efficiency.
Without caching, (K) and (V) for every past token would have to be recomputed at each generation step—a waste, because they never change. KV caching stores these matrices so that each new step only computes (Q) (and the current token’s (K/V)) and reuses the cached values, dramatically speeding up inference.
The two inference phases—prefill (parallel processing of the prompt) and decode (autoregressive generation, one token at a time)—drive latency characteristics, batching strategies, and how the KV cache is populated.
Feed‑Forward Network (FFN)
The FFN performs an expand → nonlinearity → contract transformation:
\text{FFN}(x) = \sigma\!\bigl(xW_{1} + b_{1}\bigr)W_{2} + b_{2}
- (W_{1}) – expansion weights (typically 4× the hidden size).
- (W_{2}) – contraction weights (back to the hidden size).
Conceptually:
- Expand – generate many candidate features.
- Activate – apply a non‑linear function (\sigma) (e.g., GELU) to select useful features.
- Contract – compress back into the residual stream.
Target expansion dimension is a hyper‑parameter; the common rule of thumb is a factor of 4, as used in GPT‑3.
Weights vs Hyper‑parameters
The transformer learns (tunes) the following weight matrices:
- Attention projections per head: (W_{Q}), (W_{K}), (W_{V})
- Attention output projection: (W_{O})
- Token + positional embeddings (positional embeddings are learned in models like GPT‑2; rotary embeddings have no learned parameters)
- LayerNorm scale and bias ((\gamma, \beta))
- Final output / unembedding matrix (often tied to the input embedding)
Hyper‑parameters are not learned by gradient descent; they define the network’s shape, e.g., learning rate, batch size, embedding dimension, FFN expansion factor, number of layers, number of attention heads, etc.
Loss Function
The standard next‑token language‑model loss is
L = -\frac{1}{T}\sum_{t=1}^{T}\log P\bigl(x_{t+1}\mid x_{\le t}\bigr)
Back‑propagation pushes gradients from this loss through every layer, updating all learned weights jointly to reduce the error.
Visualization
To see each step with a concrete example, try the interactive tool Transformer‑Explainer.