位置编码与上下文窗口工程：为何 Token 顺序重要

发布: 3天前 (2025年12月2日 GMT+8 11:03)

6 min read

原文: Dev.to

Source: Dev.to

缩写 & 技术术语参考

缩写

AI – Artificial Intelligence
ALiBi – Attention with Linear Biases
API – Application Programming Interface
BERT – Bidirectional Encoder Representations from Transformers
GPU – Graphics Processing Unit
GPT – Generative Pre‑trained Transformer
LLM – Large Language Model
QKV – Query, Key, Value
RAM – Random Access Memory
RoPE – Rotary Positional Embeddings
ROI – Return on Investment

技术术语

Context Window – 模型在一次请求中能够处理的最大 token 数。
Positional Encoding – 用来告诉模型哪个 token 位于哪个位置的方法。
Sinusoidal – 使用正弦和余弦波函数进行编码。
Extrapolation – 处理比训练长度更长序列的能力。
Sparse Attention – 只关注 token 子集而不是全部的注意力机制。

为什么位置信息很重要

Transformer 的注意力是 置换不变 的：每个 token 同时关注所有其他 token，但原始注意力机制中没有任何指示顺序的内容。如果没有显式的位置信息，包含相同 token 集合的句子将无法区分：

“The cat chased the dog”
“The dog chased the cat”
“Dog the cat chased the”
“Chased cat dog the the”

它们拥有相同的 token 集合，注意力得分会相同，但意义却截然不同。Positional encodings 为 transformer 提供了 哪个 token 占据位置 1、位置 2、…、位置 N 的信息。

对数据工程师的实际影响

Context‑window 限制 源自位置信息的表示方式。
不同的编码策略会影响模型处理长序列的能力。
现代技术已经支持 100 K、200 K，甚至 1 M token 的上下文窗口。
在大规模场景下，准确性与效率之间的工程权衡变得至关重要（例如，文档问答系统在长 PDF 上失效，或摘要在文档中途被截断的原因）。

现实类比

1. 打乱的相册

一套没有日期或时间戳的度假照片无法讲述旅行的故事，尽管图片本身都在。类似地，transformer 能看到所有 token，却无法在没有位置信号的情况下推断叙事顺序。

2. 无序的悬疑小说页

一本页面被打乱且没有页码的悬疑小说包含所有线索，但没有页序情节就无法理解。transformer 面临的正是 token 无序的问题。

3. 没有标签的装配线

汽车工厂收到没有部件编号或装配顺序的零件时，无法组装出车辆。顺序对制造和语言同样重要。

这些类比说明 语言本质上是顺序的；交换词序会改变意义：

“The lawyer questioned the witness” ≠ “The witness questioned the lawyer”
“I didn’t say she stole the money” ≠ “She didn’t say I stole the money”
“Time flies like an arrow” ≠ “Arrow flies like a time”

Positional Encoding 策略

Sinusoidal Positional Encodings

原始 Attention Is All You Need 论文提出了固定的 sinusoidal 编码：

[ \text{PE}(pos, 2i) = \sin!\left(\frac{pos}{10000^{2i/d}}\right) \ \text{PE}(pos, 2i+1) = \cos!\left(\frac{pos}{10000^{2i/d}}\right) ]

pos – token 位置 (0, 1, 2, …)
i – 维度索引 (0 … d/2 − 1)
d – 嵌入维度 (例如 512)

为什么使用正弦 & 余弦？

平滑性 – 相邻位置的编码相似。
有界 – 值保持在 ([-1, 1]) 之间，有助于数值稳定。
唯一性 – 每个位置产生独特的模式。
外推能力 – 该函数形式可以推广到训练时未见过的位置。

代码可视化 (Python)

import numpy as np
import matplotlib.pyplot as plt

def sinusoidal_encoding(position: int, d_model: int = 128) -> np.ndarray:
    """Generate sinusoidal positional encoding for a single position."""
    encoding = np.zeros(d_model)
    for i in range(d_model // 2):
        denominator = 10000 ** (2 * i / d_model)
        encoding[2 * i]     = np.sin(position / denominator)
        encoding[2 * i + 1] = np.cos(position / denominator)
    return encoding

# Generate encodings for positions 0‑99
positions = range(100)
encodings = np.stack([sinusoidal_encoding(p) for p in positions])

# Visualize
plt.figure(figsize=(12, 6))
plt.imshow(encodings.T, aspect='auto', cmap='RdBu')
plt.xlabel('Position')
plt.ylabel('Encoding Dimension')
plt.title('Sinusoidal Positional Encodings')
plt.colorbar(label='Encoding Value')
plt.show()

热图展示了每个位置如何获得唯一的正弦/余弦“指纹”。

应用方式

token_embedding = embed("cat")               # e.g., [0.23, -0.45, 0.67, ...]
positional_encoding = sinusoidal_encoding(5, d_model=token_embedding.shape[0])

# Combine token and position information
input_to_transformer = token_embedding + positional_encoding

通过相加，将 what（token）和 where（位置）一起注入模型输入。

Learned Positional Embeddings

除了固定函数，还可以用可学习的向量表来表示位置：

import torch.nn as nn

position_embeddings = nn.Embedding(num_embeddings=512, embedding_dim=768)

# Example lookup
pos_vec = position_embeddings(torch.tensor([5]))  # shape: (1, 768)

在训练过程中，模型会调整这些向量，以最佳方式捕获下游任务所需的位置信息。

类比

Sinusoidal – 通过确定性公式分配座位（例如，第 1 排 = VIP）。
Learned – 根据观察到的偏好安排座位（例如，A7 座位因情侣而受欢迎）。

扩展 Context Windows

Sparse Attention – 只关注 token 子集，降低二次计算成本。
Sliding / Chunked Windows – 在重叠块中处理超长序列。
Modern Techniques – ALiBi、RoPE 等相对位置方法使得上下文窗口可达 100 K + token，且计算仍可接受。

掌握这些基础后，工程师即可判断何时 直接触达硬限制，何时 通过智能工程手段规避。