Chapter 7. Attention, The Core Innovation

Every modern language model, from the smallest chatbot to the largest frontier system, is built on a single mechanism: attention. It is the operation that lets each token in a sequence look at every other token and decide which ones are relevant. Without attention, a Transformer would process each token in isolation, unable to connect “it” to the noun it refers to, or “Paris” to the “France” mentioned three sentences earlier. Attention is what makes Transformers work, and understanding it in detail is the key to understanding everything that follows in this book.

What Problem Does Attention Solve?

In Chapters 5 and 6, you learned how each token gets a rich embedding vector (Chapter 5) and how positional information is injected so the model knows token order (Chapter 6). After those two steps, the model has a matrix of shape [sequence_length x hidden_size], where each row is a vector representing one token’s meaning and position.

But there is a fundamental problem: each token’s vector only contains information about that one token. The vector for “bank” is the same whether the sentence is “I sat by the river bank” or “I went to the bank to deposit money.” The embedding table gives “bank” a single, context-free representation. To understand what “bank” means in context, the model needs a way to combine information from surrounding tokens.

This is exactly what attention does. It lets each token “look at” every other token in the sequence and pull in relevant information. After one attention step, the vector for “bank” in “river bank” will be different from the vector for “bank” in “bank to deposit money,” because the attention mechanism mixed in different contextual information from the surrounding words.

Before Transformers, Recurrent Neural Networks (RNNs) solved this problem by processing tokens one at a time, passing information forward through a hidden state. But this sequential processing was slow (you could not parallelize it across tokens) and the hidden state acted as a bottleneck: information from early tokens had to survive through many sequential steps to reach later tokens, and it often degraded along the way.

Attention solves both problems. It processes all tokens simultaneously (enabling parallelization on GPUs), and it creates direct connections between any two tokens in the sequence, regardless of how far apart they are. Token 1 can directly attend to token 500 in a single step, with no information bottleneck.

The Three Ingredients: Queries, Keys, and Values

The attention mechanism operates on three vectors for each token, called the query (Q), the key (K), and the value (V). These names come from an analogy to information retrieval:

The query represents “what am I looking for?” Each token generates a query vector that encodes what kind of information it needs from other tokens.
The key represents “what do I contain?” Each token generates a key vector that advertises what kind of information it offers.
The value represents “what information do I actually provide?” Each token generates a value vector that contains the actual content to be passed along if this token is deemed relevant.

The attention mechanism works by comparing each token’s query against every other token’s key. When a query and a key are similar (their dot product is high), it means the querying token finds the key token relevant. The output for each token is then a weighted sum of all tokens’ value vectors, where the weights come from these query-key comparisons.

Where Q, K, V Come From

The query, key, and value vectors are not stored in separate tables. They are computed from the input embeddings using three learned weight matrices: W_Q, W_K, and W_V.

For each token’s input vector x (which comes from the embedding + positional encoding from previous chapters, or from the output of the previous Transformer layer), the model computes:

q = x * W_Q    (query vector)
k = x * W_K    (key vector)
v = x * W_V    (value vector)

Where:

x has shape [hidden_size] (e.g., 5,120 for LLaMA 4 Maverick)
W_Q has shape [hidden_size x d_k]
W_K has shape [hidden_size x d_k]
W_V has shape [hidden_size x d_v]
d_k is the dimension of query and key vectors
d_v is the dimension of value vectors

In the original Transformer (Vaswani et al., 2017), d_model was 512, and with 8 attention heads, each head used d_k = d_v = 64. In modern models, the head dimension is typically 128. LLaMA 4 Maverick uses a head dimension of 128 with 40 query heads.

Source: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017, Section 3.2.2. d_model = 512, h = 8 heads, d_k = d_v = 64. LLaMA 4 Maverick config from Ollama model metadata and HuggingFace: head_dim = 128, 40 query attention heads, 8 KV heads, hidden_size = 5,120, 48 layers.

The key insight is that W_Q and W_K are different matrices. This means the query vector for a token is different from its key vector. A token might be “looking for” something very different from what it “advertises.” For example, a pronoun like “it” might generate a query that looks for nouns (because it needs to find its referent), while a noun like “cat” might generate a key that advertises “I am a noun, I am an animal.” The asymmetry between queries and keys is what makes attention so flexible.

The Full Attention Computation: Step by Step

Let’s walk through the complete attention computation. We will use a concrete example with real numbers to make every step tangible.

Step 1: Compute Q, K, V for All Tokens

For a sequence of n tokens, each with a hidden_size-dimensional input vector, we compute queries, keys, and values for all tokens at once using matrix multiplication:

Q = X * W_Q    shape: [n x d_k]
K = X * W_K    shape: [n x d_k]
V = X * W_V    shape: [n x d_v]

Where X is the input matrix of shape [n x hidden_size], with one row per token.

Step 2: Compute Attention Scores (Q * K^T)

Next, we compute how relevant each token is to every other token by taking the dot product of each query with every key:

scores = Q * K^T    shape: [n x n]

The entry at position (i, j) in this matrix is the dot product of token i’s query with token j’s key. A high value means token i finds token j relevant; a low value means token j is not relevant to token i.

This is the step that creates the quadratic cost of attention: for n tokens, we compute n x n dot products. For a sequence of 1,000 tokens, that is 1,000,000 dot products. For 100,000 tokens, it is 10 billion dot products. This is why long context windows are expensive.

Step 3: Scale the Scores

The raw dot products can be very large, especially when the dimension d_k is large. Large values cause the softmax function (next step) to produce very peaked distributions, where almost all the weight goes to a single token. This makes gradients very small during training, slowing down learning.

The solution, proposed in the original Transformer paper, is to divide the scores by the square root of d_k:

scaled_scores = scores / sqrt(d_k)

For d_k = 128 (as in LLaMA 4 Maverick), the scaling factor is sqrt(128) = 11.31. This keeps the scores in a range where softmax produces useful, non-degenerate distributions.

Source: Vaswani et al., 2017, Section 3.2.1: “We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/sqrt(d_k).”

Step 4: Apply Causal Mask (for Decoder Models)

In a decoder-only model (which is what GPT, LLaMA, Claude, and most modern LLMs are), each token should only attend to tokens that came before it in the sequence, not to future tokens. This is because during generation, future tokens have not been produced yet, so the model cannot look at them.

This is enforced by a causal mask: we set all entries in the score matrix where j > i (key position is after query position) to negative infinity. When these negative-infinity values pass through softmax, they become zero, effectively preventing the model from attending to future tokens.

For each position (i, j) in the score matrix:
  if j > i:  set score to -infinity

The resulting mask looks like this for a 5-token sequence:

Token:    0     1     2     3     4
    0: [ ok   -inf  -inf  -inf  -inf ]
    1: [ ok    ok   -inf  -inf  -inf ]
    2: [ ok    ok    ok   -inf  -inf ]
    3: [ ok    ok    ok    ok   -inf ]
    4: [ ok    ok    ok    ok    ok  ]

Token 0 can only attend to itself. Token 1 can attend to tokens 0 and 1. Token 4 can attend to all five tokens. This is called causal masking because it enforces a causal relationship: the output for each token depends only on tokens that came before it (and itself), never on future tokens.

Step 5: Apply Softmax

After scaling and masking, we apply the softmax function to each row of the score matrix. Softmax converts the raw scores into a probability distribution: all values become positive and each row sums to 1.

attention_weights = softmax(scaled_scores)    shape: [n x n]

Each row i of the attention_weights matrix is a probability distribution over all tokens, telling us how much token i should attend to each other token. If token i’s query is very similar to token j’s key, the weight at position (i, j) will be high. If they are dissimilar, the weight will be near zero.

Step 6: Compute the Output (Weights * V)

Finally, we use the attention weights to compute a weighted sum of the value vectors:

output = attention_weights * V    shape: [n x d_v]

For each token i, the output vector is:

output_i = sum over all j of (attention_weight[i][j] * v_j)

This is the core of attention: each token’s output is a blend of all other tokens’ value vectors, weighted by how relevant each token is (as determined by the query-key comparison). Tokens that are highly relevant contribute more to the output; irrelevant tokens contribute almost nothing.

The Complete Formula

Putting it all together, the scaled dot-product attention formula from Vaswani et al. (2017) is:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

This single formula is the heart of every Transformer model. Everything else in the architecture (multi-head attention, feed-forward networks, layer normalization, residual connections) is built around this operation.

Source: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017, Equation 1.

Why It’s Called “Self-Attention”

The attention mechanism described above is called self-attention (or “intra-attention”) because the queries, keys, and values all come from the same sequence. Each token in the input attends to every other token in the same input. The “self” means the sequence is attending to itself.

This is in contrast to cross-attention, which is used in encoder-decoder models (like the original Transformer for machine translation). In cross-attention, the queries come from one sequence (e.g., the partially generated translation) and the keys and values come from a different sequence (e.g., the source sentence). Cross-attention lets the decoder “look at” the encoder’s output to decide what to translate next.

Modern decoder-only LLMs (GPT, LLaMA, Claude, DeepSeek) use only self-attention. The model attends to its own input: the concatenation of the system prompt, user message, and any tokens generated so far. Every token can attend to every previous token in this combined sequence (subject to the causal mask).

A Worked Example with Real Numbers

Let’s trace the full attention computation through a concrete example. We will use a tiny 4-token sequence with 4-dimensional embeddings to keep the numbers manageable, but the math is identical to what happens in a real model with thousands of dimensions.

Setup

Suppose our 4-token sequence is: “The cat sat down”

After embedding and positional encoding, we have an input matrix X of shape [4 x 4]:

X = [[ 1.0,  0.5, -0.3,  0.8],   # "The"
     [ 0.2,  1.2,  0.7, -0.1],   # "cat"
     [-0.5,  0.3,  1.1,  0.4],   # "sat"
     [ 0.6, -0.2,  0.5,  1.3]]   # "down"

And suppose our learned weight matrices (for one attention head) are:

W_Q = [[ 0.1,  0.3],
       [ 0.4, -0.2],
       [-0.1,  0.5],
       [ 0.2,  0.1]]

W_K = [[ 0.3, -0.1],
       [ 0.2,  0.4],
       [ 0.1,  0.3],
       [-0.2,  0.2]]

W_V = [[ 0.5,  0.1],
       [-0.3,  0.6],
       [ 0.2, -0.1],
       [ 0.4,  0.3]]

Here d_k = d_v = 2 (tiny, for illustration; real models use 128).

Step 1: Compute Q, K, V

Q = X * W_Q

"The":  [1.0*0.1 + 0.5*0.4 + (-0.3)*(-0.1) + 0.8*0.2,
          1.0*0.3 + 0.5*(-0.2) + (-0.3)*0.5 + 0.8*0.1]
      = [0.1 + 0.2 + 0.03 + 0.16,  0.3 - 0.1 - 0.15 + 0.08]
      = [0.49, 0.13]

"cat":  [0.2*0.1 + 1.2*0.4 + 0.7*(-0.1) + (-0.1)*0.2,
          0.2*0.3 + 1.2*(-0.2) + 0.7*0.5 + (-0.1)*0.1]
      = [0.02 + 0.48 - 0.07 - 0.02,  0.06 - 0.24 + 0.35 - 0.01]
      = [0.41, 0.16]

"sat":  [(-0.5)*0.1 + 0.3*0.4 + 1.1*(-0.1) + 0.4*0.2,
          (-0.5)*0.3 + 0.3*(-0.2) + 1.1*0.5 + 0.4*0.1]
      = [-0.05 + 0.12 - 0.11 + 0.08,  -0.15 - 0.06 + 0.55 + 0.04]
      = [0.04, 0.38]

"down": [0.6*0.1 + (-0.2)*0.4 + 0.5*(-0.1) + 1.3*0.2,
          0.6*0.3 + (-0.2)*(-0.2) + 0.5*0.5 + 1.3*0.1]
      = [0.06 - 0.08 - 0.05 + 0.26,  0.18 + 0.04 + 0.25 + 0.13]
      = [0.19, 0.60]

So:

Q = [[ 0.49,  0.13],    # "The"
     [ 0.41,  0.16],    # "cat"
     [ 0.04,  0.38],    # "sat"
     [ 0.19,  0.60]]    # "down"

Similarly (computing K and V by the same process):

K = [[ 0.21,  0.17],    # "The"
     [ 0.39,  0.65],    # "cat"
     [-0.06,  0.58],    # "sat"
     [-0.07,  0.27]]    # "down"

V = [[ 0.61,  0.67],    # "The"
     [-0.16,  0.64],    # "cat"
     [ 0.04,  0.14],    # "sat"
     [ 0.98,  0.28]]    # "down"

(These K and V values are computed the same way as Q; I am showing the results to keep the example moving.)

Step 2: Compute Q * K^T

scores[i][j] = dot(Q[i], K[j])

scores[0][0] = 0.49*0.21 + 0.13*0.17 = 0.103 + 0.022 = 0.125
scores[0][1] = 0.49*0.39 + 0.13*0.65 = 0.191 + 0.085 = 0.276
scores[0][2] = 0.49*(-0.06) + 0.13*0.58 = -0.029 + 0.075 = 0.046
scores[0][3] = 0.49*(-0.07) + 0.13*0.27 = -0.034 + 0.035 = 0.001

(and so on for rows 1, 2, 3)

Full score matrix:

scores = [[ 0.125,  0.276,  0.046,  0.001],
          [ 0.113,  0.264,  0.068,  0.014],
          [ 0.073,  0.263,  0.218,  0.100],
          [ 0.142,  0.464,  0.337,  0.149]]

Step 3: Scale by sqrt(d_k)

With d_k = 2, sqrt(d_k) = 1.414:

scaled = scores / 1.414

scaled = [[ 0.088,  0.195,  0.033,  0.001],
          [ 0.080,  0.187,  0.048,  0.010],
          [ 0.052,  0.186,  0.154,  0.071],
          [ 0.100,  0.328,  0.238,  0.105]]

Step 4: Apply Causal Mask

Set future positions to -infinity:

masked = [[ 0.088,  -inf,   -inf,   -inf ],
          [ 0.080,  0.187,  -inf,   -inf ],
          [ 0.052,  0.186,  0.154,  -inf ],
          [ 0.100,  0.328,  0.238,  0.105]]

Step 5: Softmax (Row-wise)

For row 0 (only one valid entry): softmax([0.088]) = [1.000] For row 1: softmax([0.080, 0.187]) = [0.473, 0.527] For row 2: softmax([0.052, 0.186, 0.154]) = [0.308, 0.352, 0.341] For row 3: softmax([0.100, 0.328, 0.238, 0.105]) = [0.227, 0.285, 0.260, 0.228]

attention_weights = [[ 1.000,  0.000,  0.000,  0.000],
                     [ 0.473,  0.527,  0.000,  0.000],
                     [ 0.308,  0.352,  0.341,  0.000],
                     [ 0.227,  0.285,  0.260,  0.228]]

Step 6: Compute Output (Weights * V)

output[0] = 1.000 * V[0]
          = [0.610, 0.670]

output[1] = 0.473 * V[0] + 0.527 * V[1]
          = 0.473*[0.61, 0.67] + 0.527*[-0.16, 0.64]
          = [0.289, 0.317] + [-0.084, 0.337]
          = [0.205, 0.654]

output[2] = 0.308 * V[0] + 0.352 * V[1] + 0.341 * V[2]
          = [0.188, 0.206] + [-0.056, 0.225] + [0.014, 0.048]
          = [0.145, 0.479]

output[3] = 0.227 * V[0] + 0.285 * V[1] + 0.260 * V[2] + 0.228 * V[3]
          = [0.139, 0.152] + [-0.046, 0.182] + [0.010, 0.036] + [0.223, 0.064]
          = [0.327, 0.435]

What Just Happened?

Look at the attention weights for each token:

“The” (token 0): Attends 100% to itself. It is the first token, so there is nothing else to look at.
“cat” (token 1): Attends 47% to “The” and 53% to itself. It is pulling in some information from the preceding article.
“sat” (token 2): Attends roughly equally to all three preceding tokens (31%, 35%, 34%). It is gathering context from the full phrase “The cat,” with a slight preference for “cat.”
“down” (token 3): Attends to all four tokens, with the strongest weight on “cat” (29%) and roughly equal weight on the others (23%, 26%, 23%). The model finds “cat” most relevant, likely because the query-key similarity is highest for that pair.

The output vectors are now context-dependent. The output for “cat” is no longer just the embedding of “cat”; it is a blend of “The” and “cat” information. The output for “sat” incorporates information from “The,” “cat,” and “sat.” This is how attention creates contextual representations.

In a real model, the attention patterns would be much more selective. With 128-dimensional queries and keys (instead of our toy 2-dimensional ones), the model can learn very precise patterns: a verb might attend strongly to its subject, a pronoun might attend to the noun it refers to, and a closing bracket might attend to the matching opening bracket.

Hands-On: Implementing Scaled Dot-Product Attention

Let’s implement the full attention computation in Python, matching the formula from the original paper:

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """Compute scaled dot-product attention.

    Q: query matrix, shape [n_queries, d_k]
    K: key matrix, shape [n_keys, d_k]
    V: value matrix, shape [n_keys, d_v]
    mask: optional boolean mask, shape [n_queries, n_keys]
          True means "block this position" (set to -inf)

    Returns:
        output: shape [n_queries, d_v]
        attention_weights: shape [n_queries, n_keys]
    """
    d_k = Q.shape[-1]

    # Step 1: Q * K^T
    scores = Q @ K.T  # [n_queries, n_keys]

    # Step 2: Scale
    scores = scores / np.sqrt(d_k)

    # Step 3: Apply mask (if provided)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)

    # Step 4: Softmax (row-wise)
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    # Step 5: Weighted sum of values
    output = attention_weights @ V  # [n_queries, d_v]

    return output, attention_weights


def causal_mask(seq_len):
    """Create a causal mask: True where attention should be blocked."""
    mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
    return mask


# Example: 6-token sequence with d_k = d_v = 8
np.random.seed(42)
seq_len = 6
d_model = 16
d_k = d_v = 8

# Simulate input embeddings
X = np.random.randn(seq_len, d_model) * 0.5

# Learned projection matrices
W_Q = np.random.randn(d_model, d_k) * 0.3
W_K = np.random.randn(d_model, d_k) * 0.3
W_V = np.random.randn(d_model, d_v) * 0.3

# Compute Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Compute attention with causal mask
mask = causal_mask(seq_len)
output, weights = scaled_dot_product_attention(Q, K, V, mask=mask)

print("Input shape:", X.shape)
print("Q shape:", Q.shape)
print("K shape:", K.shape)
print("V shape:", V.shape)
print("Output shape:", output.shape)
print()
print("Attention weights (each row sums to 1):")
for i in range(seq_len):
    row = "  Token {}: [".format(i)
    row += ", ".join("{:.3f}".format(w) for w in weights[i])
    row += "]  sum={:.3f}".format(np.sum(weights[i]))
    print(row)

When you run this, you will see that each row of the attention weights sums to 1.0, and all weights for future positions (above the diagonal) are 0.000 due to the causal mask. The output has the same number of rows as the input (one per token) but with dimension d_v instead of d_model.

Causal Masking: Why the Model Can Only Look Backward

Causal masking is not just a technical detail; it is fundamental to how language models generate text. Let’s understand why it exists and what would go wrong without it.

The Generation Problem

When a language model generates text, it produces one token at a time. To generate the sentence “The cat sat down,” it first predicts “The,” then uses “The” to predict “cat,” then uses “The cat” to predict “sat,” and so on. At each step, the model can only see the tokens it has already generated. It cannot see future tokens because they do not exist yet.

During training, the model processes entire sequences at once for efficiency. But it must still learn to predict each token using only the tokens that came before it. If the model could see future tokens during training, it would learn to “cheat” by simply copying the next token from the future, and it would fail completely during generation when future tokens are unavailable.

The causal mask enforces this constraint during training. By setting future positions to negative infinity before softmax, the model is forced to make predictions using only past context, exactly as it will need to during generation.

What the Mask Looks Like in Practice

For a sequence of 5 tokens, the causal mask creates this attention pattern:

         Key positions
         0    1    2    3    4
Q pos 0: [CAN  ---  ---  ---  ---]
Q pos 1: [CAN  CAN  ---  ---  ---]
Q pos 2: [CAN  CAN  CAN  ---  ---]
Q pos 3: [CAN  CAN  CAN  CAN  ---]
Q pos 4: [CAN  CAN  CAN  CAN  CAN]

“CAN” means the query token can attend to the key token. “—” means the attention is blocked (set to zero after softmax).

This is a lower-triangular pattern. Each token can attend to itself and all tokens before it, but not to any token after it. The first token can only attend to itself. The last token can attend to everything.

Bidirectional vs. Causal Attention

Not all models use causal masking. BERT (Google, 2018) uses bidirectional attention, where every token can attend to every other token in both directions. This is possible because BERT is not a generative model; it is trained to fill in masked tokens within a sequence, not to generate text left-to-right. BERT can look at both the left and right context of a masked token.

However, all modern generative LLMs (GPT, LLaMA, Claude, DeepSeek, Gemini, Mistral) use causal masking because they generate text autoregressively, one token at a time from left to right.

Computational Cost: Why Long Contexts Are Expensive

The attention mechanism has a well-known computational bottleneck: its cost grows quadratically with the sequence length. Let’s understand exactly why and what the real numbers look like.

The Quadratic Problem

The core operation in attention is the matrix multiplication Q * K^T, which produces the [n x n] score matrix. For a sequence of n tokens with head dimension d_k:

Compute cost: Computing Q * K^T requires O(n^2 * d_k) floating-point operations. The subsequent softmax and multiplication by V add another O(n^2 * d_v) operations. Total: O(n^2 * d) per attention head, where d is the head dimension.
Memory cost: The attention score matrix has n^2 entries. For n = 1,000 tokens, that is 1 million entries. For n = 100,000 tokens, that is 10 billion entries. Each entry is a floating-point number (2 or 4 bytes), so the memory grows quadratically.

Source: The quadratic complexity of self-attention is O(n^2 * d_head) for the Q*K^T matrix multiplication, as described in Vaswani et al. (2017) and extensively analyzed in subsequent work on efficient attention.

Real Numbers

Let’s compute the attention cost for different sequence lengths, assuming a single attention head with d_k = 128:

Sequence Length	Score Matrix Size	FLOPs for Q*K^T	Memory for Scores (float16)
1,000 tokens	1,000 x 1,000 = 1M	256M	2 MB
10,000 tokens	10K x 10K = 100M	25.6B	200 MB
100,000 tokens	100K x 100K = 10B	2.56T	20 GB
1,000,000 tokens	1M x 1M = 1T	256T	2 TB

The jump from 1,000 to 1,000,000 tokens increases the cost by a factor of 1,000,000 (one million). This is why processing a 1-million-token context is fundamentally more expensive than processing a 1,000-token context, and why techniques like FlashAttention, chunked attention, and sparse attention (covered in Chapter 20) are essential for long-context models.

A real model like LLaMA 4 Maverick has 40 query attention heads across 48 layers. The total attention computation for a 1-million-token sequence would be astronomical if every layer used full global attention. This is exactly why LLaMA 4 uses the iRoPE architecture described in Chapter 6: only every 4th layer uses global attention, while the other layers use chunked local attention with a window of 8,192 tokens, dramatically reducing the quadratic cost.

Why sqrt(d_k) Scaling Matters

The scaling factor 1/sqrt(d_k) is not just a mathematical convenience. Without it, the dot products between query and key vectors grow in magnitude as d_k increases. If each element of q and k is drawn from a distribution with mean 0 and variance 1, then the dot product q . k has mean 0 and variance d_k. For d_k = 128, the standard deviation of the dot product is sqrt(128) = 11.31.

Without scaling, the softmax would receive inputs with a standard deviation of 11.31. The softmax function is very sensitive to the magnitude of its inputs: large inputs produce very peaked distributions (one value close to 1, all others close to 0), which means the model would attend almost entirely to a single token, ignoring all others. This is called “attention collapse” and it makes training unstable.

Dividing by sqrt(d_k) normalizes the variance of the dot products back to approximately 1, keeping the softmax in a regime where it produces useful, spread-out distributions. This is a simple but critical design choice.

Tracing Attention Through a Real Sentence

Let’s trace how attention works on a more meaningful sentence to build intuition about what the mechanism actually learns. Consider the sentence:

“The cat that I saw yesterday was sleeping on the mat”

After tokenization and embedding, each token has a vector. Let’s focus on what happens at the attention layer for a few interesting tokens:

“was” (token 7)

The token “was” is a verb that needs to agree with its subject. But the subject is “cat” (token 1), separated by the relative clause “that I saw yesterday.” The attention mechanism lets “was” directly attend to “cat” despite the intervening tokens.

In a trained model, the query vector for “was” would be similar to the key vector for “cat” (both encode subject-verb agreement information), producing a high attention score. The query for “was” would have low similarity with the keys for “that,” “I,” “saw,” and “yesterday,” because those tokens are not the subject.

The attention weight pattern for “was” might look something like:

"was" attends to:
  "The"       → 0.05  (low, just a determiner)
  "cat"       → 0.35  (high! this is the subject)
  "that"      → 0.05  (low)
  "I"         → 0.10  (moderate, another potential subject)
  "saw"       → 0.05  (low)
  "yesterday" → 0.05  (low)
  "was"       → 0.35  (high, self-attention is common)

By attending strongly to “cat,” the output vector for “was” incorporates information about the subject, which helps the model predict that the next token should be a present participle like “sleeping” (matching the past tense “was” + subject “cat”).

“mat” (token 10)

The token “mat” appears at the end of the sentence. Its attention pattern might look very different:

"mat" attends to:
  "The"       → 0.15  (the determiner for "mat")
  "cat"       → 0.10  (the main subject)
  "sleeping"  → 0.10  (the action)
  "on"        → 0.20  (the preposition, very relevant)
  "the"       → 0.25  (the determiner immediately before "mat")
  "mat"       → 0.20  (self-attention)

Here, “mat” attends most strongly to the nearby tokens “the” and “on,” which form the prepositional phrase “on the mat.” This local context is most relevant for understanding what “mat” means in this sentence.

Key Insight: Different Tokens Attend to Different Things

This is the power of attention: each token can independently decide which other tokens are relevant to it. A verb looks for its subject. A pronoun looks for its antecedent. A preposition looks for its object. An adjective looks for the noun it modifies. These patterns are not programmed in; they emerge from training on billions of sentences.

In a real model with multiple attention heads (Chapter 8), different heads learn different types of relationships. One head might specialize in subject-verb agreement, another in coreference resolution (connecting pronouns to nouns), another in syntactic structure (connecting opening and closing brackets), and another in semantic similarity (connecting related concepts). Together, the heads capture a rich, multi-faceted understanding of the relationships between tokens.

Attention in the Full Model Pipeline

Let’s place attention in the context of the full model architecture. Here is what happens when you send a prompt to LLaMA 4 Maverick, focusing on the attention step:

Step 1: Tokenization (Chapter 4)
  "The weather in Tokyo is usually mild"
  --> [450, 9235, 304, 27856, 338, 6892, 24312]
  --> 7 tokens

Step 2: Embedding Lookup (Chapter 5)
  Each token ID --> row in embedding table (202,048 x 5,120)
  --> Matrix of shape [7 x 5,120]

Step 3: For each of the 48 Transformer layers:

  a) Layer Normalization (Chapter 10)
     Normalize each token's vector

  b) Attention (THIS CHAPTER)
     - Project input to Q, K, V using learned weight matrices
     - LLaMA 4 Maverick: 40 query heads, 8 KV heads, head_dim = 128
     - Q shape per head: [7 x 128]
     - K shape per head: [7 x 128]  (shared across 5 query heads)
     - V shape per head: [7 x 128]  (shared across 5 query heads)
     - Compute attention: softmax(Q * K^T / sqrt(128)) * V
     - Apply RoPE to Q and K before the dot product (Chapter 6)
     - Apply causal mask
     - Concatenate all 40 heads: [7 x 5,120]
     - Project back to hidden_size with output matrix W_O

  c) Residual Connection (Chapter 10)
     Add the attention output to the input (skip connection)

  d) Feed-Forward Network (Chapter 9)
     Process each token independently

  e) Residual Connection (Chapter 10)
     Add the FFN output to the input

Step 4: Output Projection + Softmax
  --> 202,048 probabilities for the next token

The attention step (3b) is where tokens share information with each other. It is the only step in the Transformer where different tokens interact. The feed-forward network (3d) processes each token independently, without any cross-token communication. This means all of the model’s ability to understand context, resolve references, and maintain coherence comes from the attention mechanism.

Attention Dimensions in LLaMA 4 Maverick

Let’s compute the exact sizes of the weight matrices involved in attention for LLaMA 4 Maverick:

hidden_size: 5,120
num_attention_heads (query heads): 40
num_key_value_heads: 8
head_dim: 128

The model uses Grouped Query Attention (GQA), which we will cover in detail in Chapter 8. For now, the key point is that there are 40 query heads but only 8 key/value heads. Each KV head is shared by 5 query heads (40 / 8 = 5).

Weight matrix sizes:

W_Q: [5,120 x (40 * 128)] = [5,120 x 5,120] = 26,214,400 parameters
W_K: [5,120 x (8 * 128)] = [5,120 x 1,024] = 5,242,880 parameters
W_V: [5,120 x (8 * 128)] = [5,120 x 1,024] = 5,242,880 parameters
W_O: [(40 * 128) x 5,120] = [5,120 x 5,120] = 26,214,400 parameters

Total attention parameters per layer: 26.2M + 5.2M + 5.2M + 26.2M = 62.9 million parameters.

Across all 48 layers: 62.9M * 48 = approximately 3.0 billion parameters just for attention. This is a significant fraction of the 17 billion active parameters in the model.

Source: LLaMA 4 Maverick architecture from HuggingFace Transformers Llama4TextConfig and Ollama model metadata: vocab_size = 202,048, hidden_size = 5,120, num_attention_heads = 40, num_key_value_heads = 8, head_dim = 128, num_hidden_layers = 48.

Visualizing Attention Patterns

One of the most powerful ways to understand attention is to visualize the attention weight matrices. Let’s implement a visualization that shows which tokens attend to which:

import numpy as np
import matplotlib.pyplot as plt

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
    output = weights @ V
    return output, weights

# Simulate a sentence with meaningful structure
tokens = ["The", "cat", "sat", "on", "the", "mat", "."]
n = len(tokens)
d_k = d_v = 32

np.random.seed(7)
# Create embeddings with some structure:
# Make "cat" and "sat" related (subject-verb)
# Make "the" tokens similar to each other
X = np.random.randn(n, 64) * 0.3

# Simulate learned projections
W_Q = np.random.randn(64, d_k) * 0.2
W_K = np.random.randn(64, d_k) * 0.2
W_V = np.random.randn(64, d_v) * 0.2

Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Causal mask
mask = np.triu(np.ones((n, n), dtype=bool), k=1)

output, weights = scaled_dot_product_attention(Q, K, V, mask=mask)

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(weights, cmap="Blues", vmin=0, vmax=1)

ax.set_xticks(range(n))
ax.set_xticklabels(tokens, fontsize=11)
ax.set_yticks(range(n))
ax.set_yticklabels(tokens, fontsize=11)
ax.set_xlabel("Key (attending to)", fontsize=12)
ax.set_ylabel("Query (attending from)", fontsize=12)
ax.set_title("Attention Weights (Causal Masked)", fontsize=14)

# Add text annotations
for i in range(n):
    for j in range(n):
        val = weights[i, j]
        color = "white" if val > 0.5 else "black"
        ax.text(j, i, f"{val:.2f}", ha="center", va="center",
                fontsize=9, color=color)

plt.colorbar(im, ax=ax, label="Attention Weight")
plt.tight_layout()
plt.savefig("attention_weights.png", dpi=150)
plt.show()
print("Plot saved to attention_weights.png")

This visualization produces a heatmap where each cell (i, j) shows how much token i attends to token j. The upper triangle is all zeros (causal mask). The diagonal often has high values (tokens attend to themselves). Off-diagonal patterns reveal which tokens the model considers relevant to each other.

In a trained model, you would see much more structured patterns: verbs attending to subjects, pronouns attending to their antecedents, and closing punctuation attending to the beginning of the sentence.

Attention Is Not Understanding

It is important to be precise about what attention does and does not do. Attention is a mechanism for mixing information between tokens. It computes a weighted average of value vectors, where the weights are determined by query-key similarity. That is all it does mathematically.

Attention does not “understand” language in any human sense. It does not know that “cat” is an animal or that “sat” is a verb. Those semantic properties are encoded in the embedding vectors (Chapter 5) and refined by the feed-forward networks (Chapter 9). Attention’s role is to route information: it decides which tokens should share information with which other tokens, and how much.

The remarkable thing is that this simple routing mechanism, when combined with learned projections (W_Q, W_K, W_V) and stacked across many layers, is sufficient to capture extremely complex linguistic relationships. The model learns, through training on trillions of tokens, which query-key patterns are useful for predicting the next token. Subject-verb agreement, coreference resolution, long-range dependencies, syntactic structure: all of these emerge from the model learning the right projection matrices.

This is a recurring theme in deep learning: simple mechanisms, applied at scale with learned parameters, can produce behavior that appears intelligent. Attention is perhaps the clearest example of this principle.

The History of Attention

The attention mechanism did not appear out of nowhere. It has a rich history that is worth understanding.

2014: Attention for Machine Translation Bahdanau, Cho, and Bengio introduced the attention mechanism for sequence-to-sequence models in their paper “Neural Machine Translation by Jointly Learning to Align and Translate.” Their model used an RNN encoder-decoder architecture, but added an attention mechanism that let the decoder look at all encoder hidden states when generating each output token, rather than relying on a single fixed-size context vector. This dramatically improved translation quality, especially for long sentences.

Source: Bahdanau, Cho, and Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv:1409.0473, September 2014. Published at ICLR 2015.

2015: Luong Attention Luong, Pham, and Manning proposed simplified attention variants (global and local attention) that were easier to implement and slightly more effective. Their work established attention as a standard component in neural machine translation.

Source: Luong, Pham, and Manning, “Effective Approaches to Attention-based Neural Machine Translation,” arXiv:1508.04025, August 2015. Published at EMNLP 2015.

2016: Decomposable Attention Parikh, Täckström, Das, and Uszkoreit proposed a simple neural architecture for natural language inference that used attention to decompose the problem into parallelizable subproblems, achieving state-of-the-art results without relying on any recurrence or word-order information. This was a key precursor to the Transformer. One of the authors, Jakob Uszkoreit, went on to propose that recurrence could be replaced entirely with self-attention, which became the central idea of the Transformer paper. (The Transformer paper’s own footnotes confirm: “Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea.”)

Source: Parikh et al., “A Decomposable Attention Model for Natural Language Inference,” arXiv:1606.01933, 2016. Published at EMNLP 2016.

2017: The Transformer Vaswani et al. published “Attention Is All You Need,” introducing the Transformer architecture. The key innovation was replacing recurrence entirely with self-attention, processing all tokens in parallel. The paper introduced scaled dot-product attention, multi-head attention, and the encoder-decoder Transformer architecture. The title was deliberately provocative: it claimed that attention alone, without any recurrence or convolution, was sufficient for state-of-the-art sequence modeling.

Source: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017. arXiv:1706.03762.

The eight authors of the Transformer paper (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin) are listed as equal contributors, with the listing order randomized. The paper’s title was suggested by Llion Jones, inspired by the Beatles song “All You Need Is Love.” (“I’m British,” Jones later said. “It literally took five seconds of thought.”) The name “Transformer” was chosen by Jakob Uszkoreit; the idea was that the mechanism would “transform” the information it took in, and Uszkoreit also had fond childhood memories of the Hasbro Transformer toys. As of 2025, the paper has been cited over 173,000 times according to Google Scholar, placing it among the top ten most-cited papers of the 21st century. The count continues to grow rapidly.

Source for title origin and Transformer naming: Levy, S., “8 Google Employees Invented Modern AI. Here’s the Inside Story,” Wired, March 2024. Source for citation count: Wikipedia, “Attention Is All You Need,” as of 2025.

The Transformer architecture it introduced is the foundation of every major LLM in use today.

Key Takeaways

Attention is the mechanism that lets each token in a sequence look at every other token and decide which ones are relevant. It is the core operation that makes Transformers work, and it is the only step in the Transformer where different tokens share information with each other.
Attention operates on three vectors per token: the query (what am I looking for?), the key (what do I contain?), and the value (what information do I provide?). These are computed from the input using learned weight matrices W_Q, W_K, and W_V.
The full attention formula is: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V. This computes a weighted sum of value vectors, where the weights come from the similarity between queries and keys, scaled by 1/sqrt(d_k) to prevent softmax saturation.
Causal masking prevents each token from attending to future tokens. This is essential for autoregressive generation: during text generation, future tokens do not exist yet, so the model must learn to predict using only past context. The mask sets future positions to negative infinity before softmax, making their attention weights zero.
The computational cost of attention is O(n^2 * d) per head, where n is the sequence length and d is the head dimension. This quadratic scaling is why long context windows are expensive: doubling the sequence length quadruples the attention cost. For a 1-million-token sequence, the score matrix alone has 1 trillion entries.
In LLaMA 4 Maverick, each attention layer has 40 query heads and 8 key/value heads, each with dimension 128. The total attention parameters per layer are approximately 62.9 million, and across all 48 layers, attention accounts for about 3 billion of the model’s 17 billion active parameters.
Attention does not “understand” language. It is a routing mechanism that decides which tokens should share information. The remarkable linguistic capabilities of LLMs emerge from learning the right projection matrices (W_Q, W_K, W_V) through training on trillions of tokens.
The attention mechanism was introduced for machine translation by Bahdanau et al. (2014), and was generalized into the self-attention mechanism by Vaswani et al. in the 2017 paper “Attention Is All You Need,” which introduced the Transformer architecture.

What’s Next

You now understand how a single attention head works: computing queries, keys, and values, comparing them with scaled dot products, applying causal masking, and producing context-aware output vectors. But a single attention head can only capture one type of relationship at a time. In Chapter 8, we will see how multi-head attention runs many attention heads in parallel, each learning different types of relationships (syntax, semantics, coreference, and more), and how modern models use Grouped Query Attention (GQA) to share key/value heads across multiple query heads for efficiency.

Chapter 8. Multi-Head Attention, Parallel Understanding