Chapter 6. Positional Encoding, Word Order Matters

Part 2. Text to Numbers, The Input Pipeline

In Chapter 5, you learned how the embedding table converts each token ID into a rich vector of thousands of numbers. But there is a critical piece of information missing from those vectors: position. The embedding for “cat” is exactly the same whether it appears as the first word in a sentence or the thousandth. Without some way to encode position, the sentence “The dog bit the man” would be indistinguishable from “The man bit the dog” to the model. Both sentences contain the same tokens, producing the same embedding vectors, just in a different order. Positional encoding solves this problem, and the choice of how to encode position turns out to be one of the most consequential architectural decisions in modern LLMs, directly determining how long a context window the model can support.

Why Transformers Have No Built-In Sense of Order

Before Transformers, the dominant architectures for language modeling were Recurrent Neural Networks (RNNs) and their variants (LSTMs, GRUs). These models processed tokens one at a time, in sequence: first token 1, then token 2, then token 3, and so on. The sequential processing meant that order was baked into the computation itself. The model’s internal state after processing “The dog bit” was fundamentally different from its state after processing “The bit dog,” because the tokens arrived in a different order.

Transformers, introduced by Vaswani et al. in the 2017 paper “Attention Is All You Need,” took a radically different approach. Instead of processing tokens one at a time, a Transformer processes all tokens in a sequence simultaneously using the self-attention mechanism (which we will cover in detail in Chapter 7). Self-attention computes a weighted sum over all tokens in the sequence, where the weights are determined by how relevant each token is to every other token.

Source: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017. arXiv:1706.03762.

The problem is that self-attention is permutation-invariant. This means that if you shuffle the order of the input tokens, the attention computation produces the same result (just shuffled in the same way). Mathematically, self-attention treats the input as a set, not a sequence. It has no way to distinguish “The dog bit the man” from “The man bit the dog” or even “dog The man the bit.”

This is a fundamental limitation. Word order carries enormous meaning in language. “The dog bit the man” and “The man bit the dog” describe completely different events. “I never said she stole my money” has seven different meanings depending on which word you emphasize, and emphasis is partly conveyed through position and context. Without positional information, the Transformer would be blind to all of this.

The solution is to inject positional information into the input before the Transformer processes it. This is what positional encoding does: it adds information about each token’s position in the sequence so that the model can distinguish between the same token appearing at different positions.

The Three Families of Positional Encoding

Over the years, researchers have developed several approaches to positional encoding. They fall into three broad families:

Absolute position embeddings: Assign a unique vector to each position (position 0, position 1, position 2, …) and add it to the token embedding. The original Transformer used sinusoidal functions for this. GPT-2 and BERT used learned position embeddings.
Relative position encodings: Instead of encoding “this token is at position 47,” encode “this token is 3 positions to the right of that token.” This captures the intuition that in language, relative distance between words often matters more than absolute position. RoPE (Rotary Position Embeddings) and ALiBi (Attention with Linear Biases) fall into this category.
No positional encoding (NoPE): Some recent research has shown that certain layers can function without any explicit positional encoding at all, relying on other mechanisms (like causal masking) to implicitly learn positional information. LLaMA 4 uses this in its iRoPE architecture, interleaving layers with RoPE and layers with no positional encoding.

Let’s walk through each approach, starting with the simplest.

Absolute Position Embeddings: The Original Approach

Sinusoidal Positional Encoding (Vaswani et al., 2017)

The original Transformer paper proposed a clever mathematical solution: encode each position as a vector of sine and cosine values at different frequencies. The formula is:

PE(pos, 2i)     = sin(pos / 10000^(2i / d_model))
PE(pos, 2i + 1) = cos(pos / 10000^(2i / d_model))

Where:

pos is the position in the sequence (0, 1, 2, 3, …)
i is the dimension index (0, 1, 2, …, d_model/2 - 1)
d_model is the model’s embedding dimension (e.g., 512 in the original Transformer)

Source: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017, Section 3.5.

Each position gets a unique vector of d_model numbers. Even-indexed dimensions use sine, odd-indexed dimensions use cosine. The key insight is the 10000^(2i / d_model) term in the denominator: it creates a range of frequencies. Low-indexed dimensions oscillate rapidly (high frequency), while high-indexed dimensions oscillate slowly (low frequency). This is analogous to how binary numbers work: the least significant bit flips every step, the next bit flips every 2 steps, the next every 4 steps, and so on. The combination of all these frequencies creates a unique “fingerprint” for each position.

The positional encoding vector is then added to the token embedding vector, element by element:

input_to_transformer = token_embedding + positional_encoding

This means the model receives a single vector per token that combines both “what this token means” (from the embedding table) and “where this token is” (from the positional encoding).

Why Sine and Cosine?

Vaswani et al. chose sinusoidal functions for two reasons:

Unique positions: Each position produces a unique combination of sine and cosine values across all dimensions, so the model can distinguish any two positions.
Relative position through linear transformation: For any fixed offset k, the positional encoding at position pos+k can be expressed as a linear transformation of the encoding at position pos. This means the model could, in theory, learn to attend to relative positions (e.g., “the token 3 positions back”) by learning the appropriate linear transformation. In practice, this theoretical property did not fully deliver on its promise, which motivated later approaches.

Limitations of Sinusoidal Encoding

Sinusoidal positional encoding has a significant limitation: it does not extrapolate well beyond the training length. If a model is trained on sequences of up to 512 tokens, the sinusoidal encodings for positions 0 through 511 are well-learned, but position 513 or 1000 produces encoding values the model has never seen during training. Performance degrades rapidly outside the training range.

This is why the original Transformer had a fixed maximum sequence length. The model could not process sequences longer than what it was trained on.

Learned Position Embeddings (GPT-2, BERT)

An alternative to the fixed sinusoidal formula is to simply learn the positional encodings during training, just like token embeddings. This approach creates a position embedding table: a matrix of shape [max_sequence_length x d_model], where each row is a learnable vector for one position.

GPT-2 (OpenAI, 2019) used learned positional embeddings with a maximum sequence length of 1,024 tokens. Its position embedding table has shape [1,024 x 768], containing 786,432 learnable parameters. During training, the model learns what each position “means” through backpropagation, just as it learns what each token means.

Source: GPT-2 architecture from OpenAI (2019): 1,024 maximum positions, 768 embedding dimensions, learned positional embeddings.

BERT (Google, 2018) also used learned positional embeddings, with a maximum sequence length of 512 tokens and an embedding dimension of 768. Its position embedding table has shape [512 x 768].

Source: Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2018. 512 maximum positions, 768 hidden dimensions.

The lookup works identically to the token embedding lookup from Chapter 5. When the model processes a token at position 47, it retrieves row 47 from the position embedding table and adds it to the token embedding:

input_to_transformer[pos] = token_embedding[token_id] + position_embedding[pos]

Learned positional embeddings have the same fundamental limitation as sinusoidal encodings: they cannot handle positions beyond the maximum sequence length. GPT-2 cannot process sequences longer than 1,024 tokens because there is no position embedding for position 1,025. The model literally has no vector to add for that position.

Research by James Simon (2024) found that GPT-2’s learned positional embeddings form a roughly helical structure in the embedding space, with nearby positions having similar vectors and distant positions having dissimilar vectors. The positional embeddings also occupy a largely orthogonal subspace from the token embeddings, meaning the model can separate “what” information from “where” information.

Source: Simon, “Insights into GPT-2’s positional encodings,” 2024; LessWrong, “GPT-2’s positional embedding matrix is a helix,” 2023.

Hands-On: Visualizing Sinusoidal Positional Encodings

Let’s implement sinusoidal positional encoding and visualize the patterns:

import numpy as np
import matplotlib.pyplot as plt

def sinusoidal_encoding(max_len, d_model):
    """Generate sinusoidal positional encodings."""
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len).reshape(-1, 1)
    div_term = 10000 ** (2 * np.arange(d_model // 2) / d_model)

    pe[:, 0::2] = np.sin(position / div_term)  # even dimensions
    pe[:, 1::2] = np.cos(position / div_term)  # odd dimensions
    return pe

# Generate encodings for 128 positions with 64 dimensions
pe = sinusoidal_encoding(max_len=128, d_model=64)

# Plot the encoding matrix as a heatmap
plt.figure(figsize=(12, 6))
plt.imshow(pe, aspect="auto", cmap="RdBu", interpolation="nearest")
plt.colorbar(label="Encoding value")
plt.xlabel("Dimension index")
plt.ylabel("Position in sequence")
plt.title("Sinusoidal Positional Encoding (128 positions, 64 dimensions)")
plt.tight_layout()
plt.savefig("sinusoidal_pe.png", dpi=150)
plt.show()
print("Plot saved to sinusoidal_pe.png")

When you run this, you’ll see a heatmap where:

The leftmost columns (low dimension indices) oscillate rapidly, alternating between red and blue every few positions.
The rightmost columns (high dimension indices) change slowly, with broad bands of color spanning many positions.
Each row (position) has a unique pattern across all columns, giving it a distinct “fingerprint.”

This visualization makes the frequency structure concrete: low dimensions encode fine-grained position differences (nearby positions look different), while high dimensions encode coarse-grained position information (only distant positions look different). Together, they create a unique encoding for every position.

Let’s also verify that nearby positions have similar encodings while distant positions have different encodings:

import numpy as np

def sinusoidal_encoding(max_len, d_model):
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len).reshape(-1, 1)
    div_term = 10000 ** (2 * np.arange(d_model // 2) / d_model)
    pe[:, 0::2] = np.sin(position / div_term)
    pe[:, 1::2] = np.cos(position / div_term)
    return pe

pe = sinusoidal_encoding(max_len=512, d_model=512)

def cosine_sim(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

pairs = [
    (0, 1, "adjacent positions"),
    (0, 5, "5 apart"),
    (0, 50, "50 apart"),
    (0, 200, "200 apart"),
    (100, 101, "adjacent (100,101)"),
    (100, 105, "5 apart (100,105)"),
    (100, 300, "200 apart (100,300)"),
]

print(f"{'Positions':<25s}  {'Distance':>8s}  {'Cosine Similarity':>18s}")
print("-" * 55)
for p1, p2, label in pairs:
    sim = cosine_sim(pe[p1], pe[p2])
    print(f"{label:<25s}  {p2-p1:>8d}  {sim:>18.4f}")

This will show that adjacent positions have high cosine similarity (around 0.9+), while positions far apart have much lower similarity. The similarity depends only on the distance between positions, not on the absolute positions themselves, which is the relative-position property that Vaswani et al. designed into the sinusoidal encoding.

Rotary Position Embeddings (RoPE): The Modern Standard

Absolute position embeddings (whether sinusoidal or learned) have a hard ceiling: they cannot handle sequences longer than the maximum position seen during training. This limitation became increasingly painful as applications demanded longer context windows. Researchers needed a positional encoding that could capture relative positions, scale to longer sequences, and integrate cleanly with the attention mechanism.

Rotary Position Embeddings (RoPE), proposed by Jianlin Su et al. in 2021, solved these problems with an elegant mathematical idea: instead of adding positional information to the token embeddings, rotate the query and key vectors in the attention mechanism based on their position.

Source: Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding,” arXiv:2104.09864, April 2021. Published in Neurocomputing, 2024.

RoPE has become the dominant positional encoding method in modern open-weight LLMs. It is used by LLaMA 1, 2, 3, and 4 (Meta), Mistral 7B and Mixtral (Mistral AI), Qwen 3 (Alibaba), DeepSeek-V3 (DeepSeek), and many others.

One notable variation: DeepSeek-V3 uses Multi-head Latent Attention (MLA), which compresses keys and values into a low-rank latent space for efficient inference. In MLA, RoPE is applied only to a small “decoupled” portion of the key and query vectors (64 dimensions out of the full head dimension), not to the entire key and query. The rest of the key and query carry content information without positional encoding. This hybrid design lets DeepSeek-V3 benefit from RoPE’s relative position encoding while keeping its KV cache extremely compact. We will revisit MLA in Chapter 8 when we cover attention variants.

Source: DeepSeek-V3 technical report (arXiv:2412.19437), Section 2.1.1, Multi-Head Latent Attention.

The Core Idea: Rotation Encodes Position

To understand RoPE, let’s start with a simple 2D example. Imagine you have a 2-dimensional vector [x, y]. You can rotate this vector by an angle theta using a rotation matrix:

[x']   [cos(theta)  -sin(theta)] [x]
[y'] = [sin(theta)   cos(theta)] [y]

After rotation, the vector points in a different direction, but its length stays the same. The angle of rotation determines how much the vector turns.

RoPE’s insight is this: if you rotate the query vector at position m by angle mtheta, and rotate the key vector at position n by angle ntheta, then the dot product between the rotated query and key depends only on the relative distance (m - n), not on the absolute positions m and n.

This is because when you compute the dot product of two rotated vectors, the rotation angles subtract:

dot_product(rotate(q, m*theta), rotate(k, n*theta)) depends on (m - n)*theta

This means the attention score between two tokens depends on how far apart they are, not on where they are in absolute terms. The word “dog” attending to the word “the” three positions back produces the same attention pattern whether this happens at positions [3, 0] or positions [503, 500]. This is exactly the relative-position property that language needs.

How RoPE Works in Practice

In a real model, the embedding dimension is not 2 but thousands (e.g., 5,120 in LLaMA 4 Maverick). RoPE handles this by splitting the embedding into pairs of dimensions and applying a 2D rotation to each pair independently, with a different rotation frequency for each pair.

For a d-dimensional vector, RoPE splits it into d/2 pairs: dimensions (0, 1), (2, 3), (4, 5), and so on. Each pair gets its own rotation frequency, defined by:

theta_i = 1 / (base^(2i / d))

Where:

i is the pair index (0, 1, 2, …, d/2 - 1)
d is the dimension of the query/key vectors
base is a hyperparameter (10,000 in the original RoPE paper)

For a token at position m, the rotation angle for pair i is:

angle_i = m * theta_i

The first pair (i=0) has the highest frequency: theta_0 = 1/base^0 = 1, so the rotation angle equals the position number. The last pair (i=d/2-1) has the lowest frequency: theta_{d/2-1} is very small, so the rotation angle changes very slowly with position. This creates the same multi-frequency structure as sinusoidal encodings, but applied as rotations to the query and key vectors rather than as additions to the embeddings.

The rotation is applied to the query and key vectors after the linear projections in the attention mechanism, but before the dot product that computes attention scores. The value vectors are not rotated. This means positional information affects which tokens attend to which (through Q and K), but not the content that gets passed forward (through V).

The Math (Simplified)

For each pair of dimensions (2i, 2i+1) in the query vector q at position m:

q'[2i]     = q[2i] * cos(m * theta_i) - q[2i+1] * sin(m * theta_i)
q'[2i + 1] = q[2i] * sin(m * theta_i) + q[2i+1] * cos(m * theta_i)

The same rotation is applied to the key vector k at position n, using n instead of m. When the dot product q’ . k’ is computed, the result depends on (m - n) * theta_i for each pair, giving the attention mechanism access to relative position information.

The Base Frequency and Context Length

The base parameter in RoPE (often called rope_theta in model configurations) controls the range of frequencies and, consequently, how well the model handles different sequence lengths. A larger base value produces lower frequencies, which means the rotation angles change more slowly with position. This allows the model to distinguish between positions that are farther apart.

Here are the base values used by real models:

Model	RoPE Base (rope_theta)	Context Window	Year
LLaMA 1	10,000	2,048 tokens	2023
LLaMA 2	10,000	4,096 tokens	2023
Mistral 7B	10,000	8,192 tokens (4,096 sliding window)	2023
Code Llama	1,000,000	100,000 tokens	2023
DeepSeek-V3	10,000 (with YaRN scaling)	128,000 tokens	2024
LLaMA 3.1	500,000	128,000 tokens	2024
LLaMA 4 Maverick	500,000	1,048,576 tokens (1M)	2025

Sources: LLaMA 1 and 2 from Meta (2023), rope_theta = 10,000; Code Llama from Meta (2023), rope_theta = 1,000,000; LLaMA 3.1 from Meta (July 2024), rope_theta = 500,000, 128K context; DeepSeek-V3 from DeepSeek technical report (arXiv:2412.19437, Section 4.3), rope_theta = 10,000 with YaRN context extension from 4K to 128K; LLaMA 4 Maverick from HuggingFace model config and Meta AI (April 2025), rope_theta = 500,000, 1M context window; Mistral 7B from Mistral AI (September 2023), rope_theta = 10,000, sliding window attention of 4,096 tokens within an 8,192-token context.

Notice the pattern: as context windows grew from 2K to 1M tokens, the base frequency increased from 10,000 to 500,000 or even 1,000,000. Increasing the base stretches out the rotation frequencies, allowing the model to represent positions across a much wider range without the rotation angles “wrapping around” and becoming ambiguous. However, increasing the base is not the only strategy. DeepSeek-V3 keeps the base at 10,000 but uses YaRN scaling (described later in this chapter) to extend its context to 128,000 tokens. Both approaches work; the choice depends on whether the model is trained from scratch with long context (where a high base is natural) or extended after initial training (where scaling methods are more practical).

Hands-On: Implementing RoPE

Let’s implement RoPE from scratch to see exactly how it works:

import numpy as np

def compute_rope_frequencies(dim, base=10000.0):
    """Compute the rotation frequencies for each dimension pair."""
    i = np.arange(0, dim, 2, dtype=np.float64)
    theta = 1.0 / (base ** (i / dim))
    return theta

def apply_rope(x, position, theta):
    """Apply RoPE to a vector x at a given position.

    x: vector of shape (dim,)
    position: integer position in the sequence
    theta: rotation frequencies of shape (dim/2,)
    """
    dim = len(x)
    angles = position * theta  # rotation angle for each pair

    x_rotated = np.zeros_like(x)
    for i in range(dim // 2):
        cos_a = np.cos(angles[i])
        sin_a = np.sin(angles[i])
        x_rotated[2 * i]     = x[2 * i] * cos_a - x[2 * i + 1] * sin_a
        x_rotated[2 * i + 1] = x[2 * i] * sin_a + x[2 * i + 1] * cos_a
    return x_rotated

# Demonstrate with a small example
dim = 8  # 8 dimensions = 4 rotation pairs
theta = compute_rope_frequencies(dim, base=10000.0)

print("Rotation frequencies (theta):")
for i, t in enumerate(theta):
    print(f"  Pair {i}: theta = {t:.6f}")

# Create a sample query and key vector
np.random.seed(42)
q = np.random.randn(dim)
k = np.random.randn(dim)

# Apply RoPE at different positions
q_pos5 = apply_rope(q, position=5, theta=theta)
k_pos3 = apply_rope(k, position=3, theta=theta)
k_pos8 = apply_rope(k, position=8, theta=theta)

# Dot products
dot_5_3 = np.dot(q_pos5, k_pos3)  # relative distance = 2
dot_5_8 = np.dot(q_pos5, k_pos8)  # relative distance = -3

# Now shift both positions by 100 (same relative distances)
q_pos105 = apply_rope(q, position=105, theta=theta)
k_pos103 = apply_rope(k, position=103, theta=theta)
k_pos108 = apply_rope(k, position=108, theta=theta)

dot_105_103 = np.dot(q_pos105, k_pos103)  # relative distance = 2
dot_105_108 = np.dot(q_pos105, k_pos108)  # relative distance = -3

print(f"\nDot product (pos 5, pos 3), distance=2:   {dot_5_3:.6f}")
print(f"Dot product (pos 105, pos 103), distance=2: {dot_105_103:.6f}")
print(f"  Difference: {abs(dot_5_3 - dot_105_103):.10f}")

print(f"\nDot product (pos 5, pos 8), distance=-3:   {dot_5_8:.6f}")
print(f"Dot product (pos 105, pos 108), distance=-3: {dot_105_108:.6f}")
print(f"  Difference: {abs(dot_5_8 - dot_105_108):.10f}")

When you run this, you’ll see that the dot products are identical (or nearly identical, within floating-point precision) for the same relative distance, regardless of the absolute positions. The dot product between positions 5 and 3 (distance 2) is the same as between positions 105 and 103 (distance 2). This confirms that RoPE encodes relative position through the attention mechanism.

Why RoPE Won

RoPE became the dominant positional encoding for several reasons:

Relative position naturally: The dot product between rotated queries and keys depends only on relative distance, which is what language needs.
No extra parameters: Unlike learned positional embeddings, RoPE adds zero learnable parameters. The rotation angles are computed from a fixed formula.
No extra computation in the attention score: Unlike additive relative position methods (Shaw et al., 2018; Transformer-XL), RoPE does not add terms to the attention score matrix. It modifies the Q and K vectors before the dot product, which is computationally simpler.
Context extension is possible: By adjusting the base frequency or applying scaling techniques, RoPE-based models can be extended to much longer contexts than they were trained on. This is the key advantage that enabled the jump from 4K to 1M+ token context windows.

ALiBi: Attention with Linear Biases

While RoPE encodes position by rotating vectors, ALiBi (Attention with Linear Biases) takes an even simpler approach: it adds a distance-based penalty directly to the attention scores. No positional embeddings, no rotations, no extra parameters at all.

ALiBi was introduced by Press, Smith, and Lewis in 2022 in the paper “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.”

Source: Press, Smith, and Lewis, “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation,” ICLR 2022.

How ALiBi Works

The idea is remarkably simple. After computing the raw attention scores (the dot product of queries and keys), ALiBi subtracts a penalty proportional to the distance between the query and key positions:

attention_score(i, j) = q_i . k_j + m * (j - i)

Where:

i is the query position
j is the key position
m is a head-specific slope (a constant, not learned)
(j - i) is the signed distance between positions

In causal (left-to-right) language modeling, j is always less than or equal to i (the model can only look backward), so (j - i) is always zero or negative. This means ALiBi always subtracts from the attention score for non-zero distances, with larger distances receiving larger penalties. Tokens that are far apart get lower attention scores, encoding a bias toward local context.

The slope m is different for each attention head and is set by a fixed formula:

m_h = 2^(-8h / n_heads)

Where h ranges from 1 to n_heads. For a model with 8 attention heads, the slopes would be:

Head	Slope (m)	Effect
1	2^(-1) = 0.5	Moderate distance penalty
2	2^(-2) = 0.25	Lighter penalty
3	2^(-3) = 0.125	Even lighter
4	2^(-4) = 0.0625	Mild penalty
5	2^(-5) = 0.03125	Very mild
6	2^(-6) = 0.01563	Nearly flat
7	2^(-7) = 0.00781	Almost no penalty
8	2^(-8) = 0.00391	Minimal penalty

Head 1 has a steep slope, strongly penalizing distant tokens and focusing on very local context. Head 8 has a gentle slope, allowing attention to spread across much longer distances. Together, the heads create a spectrum of attention ranges, from very local to nearly global.

ALiBi’s Strength: Extrapolation

ALiBi’s main advantage is extrapolation: a model trained on short sequences can process much longer sequences at inference time without significant performance degradation. Press et al. showed that a 1.3 billion parameter model trained on sequences of 1,024 tokens could handle sequences of 2,048 tokens and beyond, matching or exceeding the performance of models trained directly on longer sequences with sinusoidal positional encodings.

Source: Press et al., ICLR 2022. A model trained on 1,024 tokens achieved comparable perplexity to a sinusoidal model trained on 2,048 tokens when evaluated on 2,048-token sequences.

This works because the linear bias is a simple, predictable function of distance. At position 2,000 (beyond the training range of 1,024), the bias is just m * (j - 2000), which follows the same linear pattern the model learned during training. There are no unseen positional values or out-of-range embeddings.

Models That Use ALiBi

ALiBi was adopted by several notable models:

BLOOM (BigScience, 2022): A 176 billion parameter multilingual model trained by a large research collaboration. BLOOM used ALiBi for positional encoding with a 2,048-token context window.
MPT (MosaicML, 2023): The MPT-7B and MPT-30B models used ALiBi, supporting context windows up to 65,536 tokens.

Source: BLOOM from BigScience (2022), 176B parameters, ALiBi positional encoding; MPT from MosaicML (2023), ALiBi positional encoding.

However, ALiBi has a significant limitation: it sacrifices true long-range modeling for extrapolation capability. The linear penalty means that very distant tokens always receive substantially lower attention, even when long-range dependencies are important. For tasks that require attending to information thousands of tokens back (like answering a question about something mentioned at the beginning of a long document), ALiBi’s distance penalty can be too aggressive.

Recent research has also identified a pathology in ALiBi-based models. A 2026 study found that in the BLOOM model family, ALiBi causes 31 to 44 percent of attention heads to attend almost entirely to the beginning-of-sequence token, a phenomenon called “attention collapse.” This suggests that the linear bias can interfere with the model’s ability to distribute attention effectively.

Source: arXiv:2603.09616, “Attention collapse pathology in the BLOOM family,” 2026. Found that ALiBi causes 31-44% of attention heads to collapse to attending to the beginning-of-sequence token.

This is one reason why RoPE, not ALiBi, became the dominant approach. RoPE provides relative position information without imposing a hard penalty on long-range attention, giving the model more flexibility to learn which distances matter for which tasks.

How Position Encoding Enables Context Extension

One of the most dramatic developments in LLMs between 2023 and 2026 has been the explosion of context window sizes. In 2022, most models supported 2,048 to 4,096 tokens. By March 2026, frontier models support context windows of 1 million tokens or more:

Model	Context Window	Year
GPT-2	1,024 tokens	2019
GPT-3	2,048 tokens	2020
GPT-4 Turbo	128,000 tokens	2023
LLaMA 3.1	128,000 tokens	2024
Gemini 2.5 Pro	1,000,000 tokens	2025
GPT-5	400,000 tokens (272K input + 128K output)	2025
LLaMA 4 Maverick (Instruct)	1,048,576 tokens (1M)	2025
LLaMA 4 Scout (Instruct)	10,000,000 tokens (10M)	2025
Gemini 3 Pro	1,000,000 tokens	2025
GPT-5.4	1,000,000 tokens	2026
Claude Opus 4.6	1,000,000 tokens	2026

Sources: GPT-2 from OpenAI (2019), 1,024 context; GPT-3 from OpenAI (2020), 2,048 context; GPT-4 Turbo from OpenAI (November 2023), 128K context; LLaMA 3.1 from Meta (July 2024), 128K context; Gemini 2.5 Pro from Google (March 25, 2025), 1M context; GPT-5 from OpenAI (August 7, 2025), 400K total context (272K input + 128K output); LLaMA 4 Maverick Instruct from Meta (April 2025), 1M context (pre-trained at 256K, fine-tuned to 1M); LLaMA 4 Scout Instruct from Meta (April 2025), 10M context (pre-trained at 256K, fine-tuned to 10M); Gemini 3 Pro from Google (November 18, 2025), 1M context; GPT-5.4 from OpenAI (March 5, 2026), 1M context (2x pricing beyond 272K tokens); Claude Opus 4.6 from Anthropic (February 5, 2026; 1M GA at standard pricing as of March 13, 2026).

This 10,000x increase in context window size (from 1K to 10M tokens) was made possible in large part by advances in positional encoding, particularly techniques for extending RoPE to longer sequences.

Position Interpolation

The simplest approach to extending a RoPE-based model’s context is position interpolation, proposed by Chen et al. in 2023. The idea is straightforward: instead of using position indices 0, 1, 2, …, L for a sequence of length L, scale them down so they fit within the original training range.

If a model was trained with a maximum context of 4,096 tokens and you want to extend it to 16,384 tokens, you divide all position indices by 4 (the scaling factor). Position 16,384 becomes position 4,096 in the model’s internal representation. The model sees positions it was trained on, just more densely packed.

This requires a small amount of fine-tuning (typically a few hundred to a few thousand training steps on longer sequences) to adapt the model to the compressed position space, but it is far cheaper than training from scratch.

NTK-Aware Scaling

A more sophisticated approach is NTK-aware scaling (Neural Tangent Kernel-aware), which modifies the base frequency of RoPE rather than scaling the positions directly. Instead of uniformly compressing all frequencies, NTK-aware scaling increases the base parameter, which stretches out the low-frequency rotations (affecting long-range position encoding) while leaving the high-frequency rotations (affecting local position encoding) relatively unchanged.

The key insight is that high-frequency dimensions encode fine-grained local position information that should not be compressed (you still need to distinguish position 5 from position 6), while low-frequency dimensions encode coarse-grained global position information that can be stretched to cover a wider range.

This is why Code Llama (2023) used a base of 1,000,000 instead of LLaMA 2’s 10,000: the 100x increase in base frequency allowed the model to handle sequences up to 100,000 tokens, a 25x increase over LLaMA 2’s 4,096-token context.

YaRN: Yet Another RoPE Extension Method

YaRN (Peng et al., 2023) combined the best ideas from position interpolation and NTK-aware scaling into a unified framework. YaRN applies different scaling factors to different frequency dimensions: high-frequency dimensions (which encode local positions) are left unscaled, low-frequency dimensions (which encode global positions) are interpolated, and dimensions in between get a blend of both approaches. YaRN also includes an attention temperature correction to compensate for changes in the attention score distribution caused by the scaling.

Source: Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models,” arXiv:2309.00071, 2023. YaRN requires 10x fewer tokens and 2.5x fewer training steps than previous context extension methods.

YaRN demonstrated that a LLaMA 2 7B model trained on 4,096 tokens could be extended to 128,000 tokens with minimal fine-tuning and negligible performance loss on the original context range. This was a breakthrough: it showed that context extension could be cheap and effective, not requiring full retraining. DeepSeek-V3 uses YaRN to extend its context from 4K to 32K and then to 128K, applying it in two phases of just 1,000 training steps each.

Source: DeepSeek-V3 technical report (arXiv:2412.19437), Section 4.3. YaRN applied to the decoupled shared key with scale s=40, extending context from 4K to 32K (first phase) and 32K to 128K (second phase), each phase requiring only 1,000 training steps.

Increasing the Base Frequency

The most direct approach, used by Meta for LLaMA 3.1, is to simply increase the RoPE base frequency during pre-training. LLaMA 3.1 used a base of 500,000 (compared to 10,000 for LLaMA 2) and was trained directly on sequences up to 128,000 tokens. This avoids the need for post-training context extension entirely, at the cost of requiring longer training sequences from the start.

Source: LLaMA 3.1 from Meta (July 2024), rope_theta = 500,000, trained with 128K context. The base frequency increase from 10,000 to 500,000 enabled native support for 128K tokens.

The relationship between base frequency and context length is roughly logarithmic: doubling the base frequency does not double the context length, but increasing it by orders of magnitude (10,000 to 500,000) enables proportionally large increases in context (4K to 128K).

LLaMA 4’s iRoPE: The Cutting Edge

LLaMA 4 (Meta, April 2025) introduced a novel positional encoding architecture called iRoPE (interleaved Rotary Position Embeddings). Instead of applying RoPE to every attention layer, iRoPE uses a specific interleaving pattern: every fourth layer is a NoPE layer, and the remaining three out of four layers use RoPE. The two layer types work differently:

RoPE layers (3 out of every 4 layers): These use standard Rotary Position Embeddings with chunked local attention. Each token attends only to other tokens within a fixed-size local window of 8,192 tokens. RoPE provides fine-grained relative position information within each chunk. Tokens cannot attend across chunk boundaries in these layers.
NoPE layers (every 4th layer): These use global attention with no positional encoding at all. Every token can attend to every other token in the entire sequence using the full causal mask. Without positional encoding, these layers treat the input as a set, relying on the positional information already injected by the RoPE layers and the causal mask to maintain order awareness. These layers also use attention temperature tuning (a scaled softmax) to prevent attention scores from fading toward zero in very long sequences.

Source: LLaMA 4 architecture from Meta AI (April 2025) and HuggingFace Transformers implementation. iRoPE interleaves RoPE-based local attention layers (chunk size 8,192) with NoPE-based global attention layers. NoPE layers appear every 4th layer. LLaMA 4 models were pre-trained with 256K context; instruct versions were fine-tuned to 1M (Maverick) and 10M (Scout).

This hybrid approach is a clever engineering tradeoff. The RoPE layers handle local context efficiently: with a chunk size of 8,192 tokens, each token only attends to nearby tokens within its chunk, keeping computation manageable. The NoPE layers handle global context: each token can attend to any other token in the sequence, enabling long-range information flow. By interleaving the two (three local layers for every one global layer), the model gets both fine-grained local position awareness and broad global context, without requiring every layer to compute attention over the full sequence length.

The NoPE layers are particularly interesting. Research has shown that causal masking (where each token can only attend to tokens that came before it, not after) implicitly provides some positional information even without explicit positional encoding. The causal mask creates an asymmetry: the token at position 5 can attend to tokens at positions 0 through 4, while the token at position 10 can attend to tokens at positions 0 through 9. This difference in the set of available tokens gives the model indirect information about position.

To visualize how chunked attention works in the RoPE layers, consider a sequence of 6 tokens with a chunk size of 3:

Token:        0  1  2  3  4  5
Position:     0  1  2  3  4  5

RoPE layer (chunk_size=3):
  Token 0:    ■  .  .  .  .  .    (attends to chunk [0,1,2] only)
  Token 1:    ■  ■  .  .  .  .
  Token 2:    ■  ■  ■  .  .  .
  Token 3:    .  .  .  ■  .  .    (new chunk [3,4,5] starts)
  Token 4:    .  .  .  ■  ■  .
  Token 5:    .  .  .  ■  ■  ■

NoPE layer (global attention):
  Token 0:    ■  .  .  .  .  .    (full causal mask)
  Token 1:    ■  ■  .  .  .  .
  Token 2:    ■  ■  ■  .  .  .
  Token 3:    ■  ■  ■  ■  .  .
  Token 4:    ■  ■  ■  ■  ■  .
  Token 5:    ■  ■  ■  ■  ■  ■

In the real model, the chunk size is 8,192 tokens. Token 10,000 in a RoPE layer can only attend to tokens 8,192 through 10,000 (its chunk), but in a NoPE layer, it can attend to all tokens from 0 through 10,000.

This architecture is what enables LLaMA 4 Scout’s claimed 10 million token context window. Processing 10 million tokens with full attention at every layer would be computationally impossible (the quadratic cost of attention would require astronomical amounts of memory and compute). By using chunked local attention in three out of four layers and reserving global attention for every fourth layer, the model can handle extremely long sequences while keeping the computational cost manageable.

In needle-in-a-haystack retrieval tests, Meta claims perfect text retrieval performance across all 10 million tokens for Scout. Independent analysis by jangwook.net found that Scout maintains over 95% retrieval accuracy up to 8 million tokens, dropping to 89% at the full 10 million token limit. The discrepancy likely reflects differences in test methodology: simple single-needle retrieval (Meta’s test) versus more complex multi-step retrieval tasks. Early users reported that effective context for general tasks begins to degrade well before the theoretical maximum. Meta acknowledged that processing 1.4 million tokens of context requires eight NVIDIA H100 GPUs. The gap between theoretical context window and practical effective context is an active area of research.

Source: Meta AI blog, “Llama 4: Open, Multimodal Intelligence,” April 2025 (claims perfect NIAH retrieval). HuggingFace blog, “Welcome Llama 4 Maverick & Scout on Hugging Face,” April 2025. Independent needle-in-a-haystack analysis from jangwook.net (2025). DeepLearning.AI, “Meta Releases Llama 4 Models,” April 2025. Meta stated that 1.4M tokens requires 8x H100 GPUs.

Comparing Positional Encoding Approaches

Let’s summarize the key differences between the approaches we’ve covered:

Property	Sinusoidal	Learned	RoPE	ALiBi	iRoPE (LLaMA 4)
Type	Absolute	Absolute	Relative	Relative	Hybrid
Extra parameters	None	max_len x d	None	None	None
Where applied	Added to embeddings	Added to embeddings	Rotates Q, K	Bias on attention scores	Mixed per layer
Extrapolation	Poor	Poor	Moderate (with scaling)	Good	Good
Max context (practical)	~512-2K	~1K-4K	~128K-1M+ (with scaling)	~2-8x training length	~1M-10M
Used by	Original Transformer	GPT-2, BERT	LLaMA, Mistral, Qwen, DeepSeek	BLOOM, MPT	LLaMA 4

The trend is clear: the field has moved from absolute encodings (which have hard context limits) to relative encodings (which can be extended), and most recently to hybrid approaches that combine relative encoding with no-encoding layers for maximum flexibility.

Why Different Approaches Matter for Long-Context Performance

The choice of positional encoding has direct, measurable consequences for how well a model handles long contexts. Here are the key tradeoffs:

Local vs. Global Attention

Absolute position embeddings treat every position independently. Position 0 and position 100,000 are just two different entries in a lookup table, with no structural relationship between them. This makes it hard for the model to generalize: patterns learned at positions 0-4,096 during training don’t automatically transfer to positions 50,000-54,096 at inference time.

RoPE’s rotation-based approach encodes relative distance, so patterns learned at any position range transfer to any other position range (as long as the relative distances are similar). A model that learns “the word 3 positions back is often the subject of this verb” can apply that pattern at any absolute position.

ALiBi’s linear bias explicitly encodes a preference for local context, which works well for many language tasks but can hurt performance on tasks requiring long-range retrieval (like finding a specific fact mentioned thousands of tokens earlier).

The “Lost in the Middle” Problem

Research by Liu et al. (2023) documented a phenomenon called “Lost in the Middle”: language models tend to focus on information at the beginning and end of their context window, while information in the middle receives less attention. This is partly a consequence of how the attention mechanism interacts with positional information during training.

Source: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” TACL 2024 (originally arXiv:2307.03172, July 2023).

With absolute position embeddings, the model learns strong associations with the first few positions (which always contain the system prompt or beginning of the document) and the last few positions (which contain the most recent context). Middle positions get less specialized treatment.

RoPE mitigates this somewhat because it encodes relative distance rather than absolute position, but the problem persists in practice. The attention mechanism still tends to allocate more weight to nearby tokens (which are always relevant) and to the beginning of the sequence (which often contains important instructions).

This is an active area of research. Techniques like attention sinks (keeping the first few tokens always accessible), landmark attention (marking important positions for the model to attend to), and sparse attention patterns (Chapter 20) all aim to improve middle-context retrieval.

Positional Encoding in the Full Pipeline

Let’s place positional encoding in the context of the full model pipeline we’ve been building across chapters. Here’s what happens when you send a prompt to LLaMA 4 Maverick:

Step 1: Tokenization (Chapter 4)
  "The weather in Tokyo is usually mild"
  --> [450, 9235, 304, 27856, 338, 6892, 24312]

Step 2: Embedding Lookup (Chapter 5)
  Each token ID indexes into the embedding table (202,048 x 5,120)
  --> 7 vectors, each with 5,120 dimensions
  --> Matrix of shape [7 x 5,120]

Step 3: Positional Encoding (This Chapter)
  In LLaMA 4 Maverick (iRoPE architecture):
  - RoPE layers (3 out of every 4): Q and K vectors are rotated based
    on position before computing attention. Attention is local,
    chunked to 8,192 tokens.
  - NoPE layers (every 4th layer): No positional modification.
    Attention is global across the full sequence, with temperature
    scaling to prevent attention score fading.
  The embedding vectors themselves are NOT modified by position.
  Position information enters through the attention mechanism.

Step 4: Transformer Layers (Chapters 7-10)
  48 layers of attention and feed-forward processing
  --> [7 x 5,120] (same shape, transformed values)

Step 5: Output Projection + Softmax
  --> 202,048 probabilities

Notice an important distinction: in the original Transformer and in GPT-2, positional encoding is added to the embeddings at Step 2, modifying the input vectors before they enter the Transformer layers. In RoPE-based models like LLaMA 4, the embeddings are not modified. Instead, positional information is injected at Step 3, inside each attention layer, by rotating the query and key vectors. This is a cleaner separation of concerns: the embedding table handles “what does this token mean?” and the positional encoding handles “where is this token?”

Hands-On: Comparing Positional Encoding Methods

Let’s implement and compare the three main positional encoding approaches side by side. This code creates sinusoidal, learned (simulated), and RoPE encodings and shows how they affect attention scores:

import numpy as np
import matplotlib.pyplot as plt

def sinusoidal_encoding(max_len, d_model):
    """Sinusoidal positional encoding (Vaswani et al., 2017)."""
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len).reshape(-1, 1)
    div_term = 10000 ** (2 * np.arange(d_model // 2) / d_model)
    pe[:, 0::2] = np.sin(position / div_term)
    pe[:, 1::2] = np.cos(position / div_term)
    return pe

def apply_rope_to_vector(x, position, base=10000.0):
    """Apply RoPE rotation to a vector at a given position."""
    dim = len(x)
    theta = 1.0 / (base ** (2 * np.arange(dim // 2) / dim))
    angles = position * theta
    x_rot = np.zeros_like(x)
    for i in range(dim // 2):
        c, s = np.cos(angles[i]), np.sin(angles[i])
        x_rot[2*i]   = x[2*i] * c - x[2*i+1] * s
        x_rot[2*i+1] = x[2*i] * s + x[2*i+1] * c
    return x_rot

def alibi_bias(query_pos, key_pos, slope):
    """Compute ALiBi bias for a single query-key pair."""
    return slope * (key_pos - query_pos)

# Parameters
d_model = 64
seq_len = 32
np.random.seed(42)

# Generate random token embeddings (simulating a sequence)
token_embeddings = np.random.randn(seq_len, d_model) * 0.1

# --- Method 1: Sinusoidal ---
pe = sinusoidal_encoding(seq_len, d_model)
sinusoidal_inputs = token_embeddings + pe
sin_attn = sinusoidal_inputs @ sinusoidal_inputs.T  # raw attention scores

# --- Method 2: RoPE ---
rope_q = np.array([apply_rope_to_vector(token_embeddings[i], i) for i in range(seq_len)])
rope_k = np.array([apply_rope_to_vector(token_embeddings[i], i) for i in range(seq_len)])
rope_attn = rope_q @ rope_k.T

# --- Method 3: ALiBi ---
base_attn = token_embeddings @ token_embeddings.T  # no positional info
slope = 0.125  # one example head slope
alibi_matrix = np.zeros((seq_len, seq_len))
for i in range(seq_len):
    for j in range(seq_len):
        alibi_matrix[i, j] = alibi_bias(i, j, slope)
alibi_attn = base_attn + alibi_matrix

# Plot all three
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, attn, title in zip(axes,
    [sin_attn, rope_attn, alibi_attn],
    ["Sinusoidal", "RoPE", "ALiBi (slope=0.125)"]):
    im = ax.imshow(attn, cmap="RdBu", aspect="auto")
    ax.set_title(f"{title} Attention Scores")
    ax.set_xlabel("Key position")
    ax.set_ylabel("Query position")
    plt.colorbar(im, ax=ax, fraction=0.046)

plt.tight_layout()
plt.savefig("positional_encoding_comparison.png", dpi=150)
plt.show()
print("Plot saved to positional_encoding_comparison.png")

This visualization shows how each method shapes the raw attention scores differently. The sinusoidal method adds position-dependent patterns to the embeddings before computing attention. RoPE rotates the vectors, creating position-dependent interference patterns in the dot products. ALiBi adds a visible diagonal gradient (the linear bias), clearly penalizing distant token pairs.

The Evolution of Context Windows: A Timeline

The history of positional encoding is inseparable from the history of context window expansion. Here’s how the two evolved together:

2017: The Original Transformer Vaswani et al. introduced sinusoidal positional encoding. The model was trained and evaluated on sequences of a few hundred tokens. The encoding could theoretically handle any length, but in practice, performance degraded beyond the training range.

2018-2019: Learned Embeddings BERT (512 tokens) and GPT-2 (1,024 tokens) used learned positional embeddings. These models had hard context limits defined by the size of their position embedding tables. GPT-2 could not process a single token beyond position 1,023.

2020: GPT-3 GPT-3 used learned positional embeddings with a 2,048-token context window. Still a hard limit, but double GPT-2’s capacity.

2021: RoPE Introduced Su et al. published the RoPE paper, proposing rotation-based positional encoding. This was initially adopted by smaller models and research projects.

2022: ALiBi and BLOOM Press et al. published ALiBi, demonstrating strong extrapolation. BLOOM (176B parameters) adopted ALiBi. Meanwhile, RoPE began gaining traction in the open-source community.

2023: RoPE Goes Mainstream LLaMA 1 and 2 (Meta), Mistral 7B (Mistral AI), and many other models adopted RoPE. Context windows reached 4,096 to 8,192 tokens. Code Llama pushed to 100,000 tokens by increasing the RoPE base to 1,000,000. GPT-4 Turbo reached 128,000 tokens (OpenAI has not publicly disclosed its positional encoding method). YaRN and NTK-aware scaling demonstrated cheap context extension for RoPE-based models.

2024: 128K Becomes Standard LLaMA 3.1 (Meta, July 2024) shipped with 128,000-token context using RoPE with a base of 500,000. Gemini 1.5 Pro (Google) reached 1 million tokens. DeepSeek-V3 (December 2024) used RoPE with a 128,000-token context. The 128K context window became the baseline for frontier models.

2025: The Million-Token Era LLaMA 4 Maverick Instruct reached 1 million tokens using iRoPE (pre-trained at 256K, fine-tuned to 1M). LLaMA 4 Scout Instruct claimed 10 million tokens (pre-trained at 256K, fine-tuned to 10M). Gemini 2.5 Pro offered 1 million tokens. GPT-5 supported up to 400,000 tokens (272K input + 128K output). Gemini 3 Pro (November 2025) shipped with 1 million tokens. By the end of 2025, all three major frontier model families (GPT, Claude, Gemini) were converging on 1 million token context windows.

2026: 1M Becomes the Standard GPT-5.4 (OpenAI, March 5, 2026) expanded to 1 million tokens, with 2x pricing for requests exceeding 272K tokens. Claude Opus 4.6 (Anthropic, February 5, 2026) launched with a 1 million token context window in beta, and on March 13, 2026, Anthropic made the 1M context generally available at standard pricing with no premium surcharge. As of March 2026, every major frontier model offers at least 1 million tokens of context.

Sources: Context window progression compiled from official announcements by OpenAI, Meta, Google, Anthropic, Mistral AI, and DeepSeek (2017-2026). GPT-5.4 1M context from OpenAI (March 5, 2026). Claude Opus 4.6 1M GA from Anthropic (March 13, 2026).

Key Takeaways

Transformers process all tokens simultaneously using self-attention, which is permutation-invariant: it has no built-in sense of token order. Without positional encoding, “The dog bit the man” and “The man bit the dog” would be indistinguishable to the model.
Sinusoidal positional encoding (Vaswani et al., 2017) uses sine and cosine functions at different frequencies to create a unique vector for each position. It is added to the token embeddings before the Transformer layers. It does not extrapolate well beyond the training sequence length.
Learned positional embeddings (used by GPT-2 and BERT) store a learnable vector for each position in a lookup table. GPT-2’s table has shape [1,024 x 768]. Like sinusoidal encoding, learned embeddings have a hard maximum sequence length and cannot handle positions beyond the table size.
Rotary Position Embeddings (RoPE) (Su et al., 2021) encode position by rotating query and key vectors in the attention mechanism. The dot product between rotated vectors depends only on relative distance, not absolute position. RoPE adds no extra parameters and has become the dominant positional encoding in modern open-weight LLMs, used by LLaMA, Mistral, Qwen, and DeepSeek.
The RoPE base frequency (rope_theta) controls the range of rotation frequencies. Increasing the base from 10,000 (LLaMA 2) to 500,000 (LLaMA 3.1, LLaMA 4) enables much longer context windows. Context extension techniques like position interpolation, NTK-aware scaling, and YaRN (Peng et al., 2023) allow RoPE-based models to be extended to longer contexts with minimal fine-tuning.
ALiBi (Press et al., 2022) adds a linear distance-based penalty to attention scores. It requires no extra parameters and extrapolates well to longer sequences, but its distance penalty can limit long-range attention. It was used by BLOOM and MPT but has been largely superseded by RoPE.
iRoPE (LLaMA 4, Meta, April 2025) is a hybrid approach that interleaves RoPE-based local attention layers (with chunked attention of 8,192 tokens) with no-positional-encoding (NoPE) global attention layers. NoPE layers appear every fourth layer. This enables context windows of 1 million tokens (Maverick) to 10 million tokens (Scout), though effective context utilization may be shorter than the theoretical maximum.
Context windows have grown from 1,024 tokens (GPT-2, 2019) to 10 million tokens (LLaMA 4 Scout, 2025), a 10,000x increase in six years. As of March 2026, every major frontier model (GPT-5.4, Claude Opus 4.6, Gemini 3 Pro) offers at least 1 million tokens of context. This was enabled by the shift from absolute to relative positional encodings and by techniques for extending RoPE to longer sequences.
In RoPE-based models, positional information is injected inside the attention mechanism (by rotating Q and K vectors), not by modifying the input embeddings. This is a cleaner separation: the embedding table handles token meaning, and the positional encoding handles token position.

What’s Next

You now know how tokens get their meaning (Chapter 5, embeddings) and how the model knows their order (this chapter, positional encoding). With these two pieces in place, the input is ready for the core computation that makes Transformers powerful: attention. In Chapter 7, we’ll dive into the self-attention mechanism itself, walking through the full computation of queries, keys, and values with real numbers, and showing how every token in a sequence learns to attend to every other token.

Chapter 5. Embeddings, Giving Tokens Meaning