Attention Is All You Need — 전체 논문 분석

발행: 1개월 전 (2026년 3월 8일 오전 07:57 GMT+9)

8 분 소요

원문: Dev.to

위에 제공해 주신 Source 라인만으로는 번역할 본문이 없습니다. 번역을 원하는 전체 텍스트(마크다운 형식 포함)를 제공해 주시면 한국어로 번역해 드리겠습니다.

2017년 논문 “Attention Is All You Need”

Vaswani et al. 은 Transformer 를 소개했습니다 – 현재 GPT, Claude, Gemini 및 모든 주요 LLM의 기반이 되는 아키텍처입니다. 이는 순환 모델을 완전히 어텐션 메커니즘으로 대체했으며, 이후 분야는 다시는 예전 방식으로 돌아가지 못했습니다. 이 글에서는 핵심 아이디어를 살펴봅니다.

RNN의 문제점

Transformer 이전에 시퀀스 모델링은 RNN 과 LSTM 이 전부였습니다. 이들은 토큰을 하나씩, 왼쪽에서 오른쪽으로 처리하는데, 이는 두 가지 큰 문제를 야기합니다:

병렬화 불가 – 각 단계가 이전 은닉 상태에 의존하므로 학습 중에 토큰을 동시에 처리할 수 없습니다.
장거리 의존성 감소 – RNN이 토큰 500에 도달했을 때, 토큰 1의 신호는 수백 개의 은닉 상태를 거치며 압축됩니다.

어텐션 메커니즘은 이전에도 존재했지만 (예: Bahdanau attention, 2014) RNN에 부착된 형태였습니다. Transformer의 급진적인 아이디어: 어텐션만 있으면 충분하지 않을까? – 순환을 완전히 없애는 것입니다.

Encoder‑decoder 구조

Transformer는 기계 번역에 사용되는 고전적인 encoder‑decoder 아키텍처를 따릅니다:

Component	Role	# of identical layers
Encoder (왼쪽)	입력 시퀀스를 받아 풍부한 표현을 생성	6
Decoder (오른쪽)	Encoder 출력 + 이전에 생성된 토큰을 받아 다음 토큰을 생성	6

두 스택의 각 레이어는 동일한 빌딩 블록을 포함합니다:

Multi‑head attention
Feed‑forward network (FFN)
Residual connection
Layer normalization

Self‑attention

Self‑attention 은 모든 토큰이 서로를 바라보고 얼마나 “주의”를 기울일지 결정하게 합니다.

각 토큰에 대해 모델은 세 개의 벡터를 계산합니다:

Vector	Intuition
Query (Q)	“내가 찾고 있는 것은?”
Key (K)	“내가 무엇을 가지고 있는가?”
Value (V)	“내가 제공하는 정보는?”

이들은 입력 임베딩에 학습된 가중치 행렬 (W_Q, W_K, W_V) 를 곱해 얻습니다:

[ Q = XW_Q,\qquad K = XW_K,\qquad V = XW_V ]

두 토큰 사이의 어텐션 점수는 한 토큰의 query와 다른 토큰의 key의 내적입니다.
스케일드‑dot‑product 어텐션 공식은 다음과 같습니다:

[ \text{Attention}(Q,K,V)=\text{softmax}!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V ]

스케일링 팩터 (\sqrt{d_k}) 은 차원이 커질수록 내적이 너무 커지는 것을 방지합니다; 이를 없애면 softmax가 지나치게 뾰족해져 그래디언트가 사라집니다.

Multi‑head attention

전체 차원으로 한 번만 어텐션을 계산하는 대신, 모델은 (Q, K, V) 를 여러 head 로 분할합니다 (원 논문에서는 8개). 각 head는 다음 크기의 서브‑스페이스에서 작동합니다:

[ \frac{d_{\text{model}}}{h}= \frac{512}{8}=64 ]

왜 여러 head가 필요할까요? 서로 다른 head가 서로 다른 유형의 관계를 학습할 수 있기 때문입니다:

Head 1 – 구문 구조 (예: 주어‑동사 일치)
Head 2 – 위치적 근접성
Head 3 – 의미적 유사성

모든 head의 출력은 연결(concatenate)된 뒤 전체 차원으로 다시 투영됩니다.

논문에서는 multi‑head attention을 세 가지 방식으로 사용합니다:

Encoder self‑attention – 모든 입력 토큰이 다른 모든 입력 토큰을 바라봅니다.
Masked decoder self‑attention – 각 출력 토큰이 이전 출력 토큰만을 바라봅니다 (마스크가 앞을 보는 것을 방지해 자동 회귀 생성 방식을 유지).
Cross‑attention – 디코더 토큰이 인코더 출력을 바라보며, 입력 표현과 출력 생성을 연결합니다.

Positional encodings

Self‑attention 만으로는 순서 개념이 없습니다; 시퀀스를 집합으로 취급합니다. 순서 정보를 주입하기 위해 Transformer는 positional encodings 를 입력 임베딩에 더합니다 (연결이 아니라 더하기).

사인‑코사인 인코딩은 다음과 같이 정의됩니다:

[ \text{PE}{(pos,2i)} = \sin!\left(\frac{pos}{10000^{2i/d{\text{model}}}}\right) ]

[ \text{PE}{(pos,2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d{\text{model}}}}\right) ]

Source:

i/d_{\text{model}}}}\right) ]

These functions allow the model to generalize to sequence lengths longer than those seen during training, because any relative position can be expressed as a linear function of the encodings.

Feed‑forward network (FFN)

Each attention sub‑layer is followed by a position‑wise FFN applied independently to each token:

[ \text{FFN}(x)=\max(0, xW_1+b_1)W_2+b_2 ]

Two linear transformations with a ReLU in between.
Inner dimension expands to 2048 (4 × the model dimension of 512) and then projects back down.

Residual connections & layer normalization

Every sub‑layer (attention or FFN) is wrapped as

[ \text{LayerNorm}\bigl(x + \text{SubLayer}(x)\bigr) ]

The residual connection (x + \text{SubLayer}(x)) facilitates gradient flow through deep stacks, while layer normalization stabilizes activations.

Training details

Component	Setting
Optimizer	Adam with (\beta_1=0.9,\ \beta_2=0.98)
Learning‑rate schedule	Warm‑up for 4000 steps (linear increase) → decay proportional to (\text{step}^{-0.5})
Regularization	Dropout 0.1 on attention weights and after each sub‑layer; label smoothing 0.1
Training data	WMT English‑German (4.5 M sentence pairs) and English‑French (36 M pairs)
Hardware	8 × NVIDIA P100 GPUs, ~3.5 days for the large model

The Transformer achieved state‑of‑the‑art results on English‑to‑German and English‑to‑French translation, beating all previous models (including deep ensembles) while training significantly faster thanks to full parallelization.

Beyond translation

The architecture proved to be a foundation for many later models:

BERT – encoder‑only, bidirectional pre‑training.
GPT – decoder‑only, autoregressive language modeling.

…and countless other variants that dominate modern NLP and multimodal AI.

# Modeling

Vision Transformers — 이미지를 위한 동일한 아키텍처 적용

현대 AI의 거의 모든 것

논문의 핵심 통찰은 우아합니다: 시퀀스 모델링에 순환이나 컨볼루션이 필요하지 않다는 점입니다.
주의 메커니즘만으로도 — 적절히 스케일링하고, 여러 헤드로 나누며, 잔차 연결과 함께 쌓으면 — 충분합니다.

주의 메커니즘은 모든 쌍의 관계를 병렬로 계산하기 때문에 학습 속도가 크게 빨라집니다.

이 때문에 9년이 지난 지금도 모든 최첨단 모델은 여전히 Transformer를 핵심으로 사용합니다.