Skip to content
Chapter 10. Layer Normalization, Residual Connections, and the Full Transformer Block

Chapter 10. Layer Normalization, Residual Connections, and the Full Transformer Block

In Chapters 8 and 9, you learned about the two core operations inside a Transformer layer: multi-head attention (which gathers information from other tokens) and the feed-forward network (which processes each token independently). But if you simply stacked these operations one after another in a deep network, training would fail. The numbers flowing through the network would either explode to enormous values or collapse to near-zero, and the gradients used for learning would vanish or blow up. Two critical mechanisms prevent this: layer normalization, which keeps the numbers in a stable range, and residual connections, which create shortcuts that let information and gradients flow freely through dozens or hundreds of layers. Together with attention and the FFN, these components form the complete Transformer block, the repeating unit that is stacked to build every modern language model.


Why Deep Networks Need Help

In Chapter 3, you learned that neural networks learn by backpropagation: computing gradients of the loss function with respect to each weight, then nudging the weights in the direction that reduces the loss. This works well for shallow networks with a few layers. But as networks get deeper, two problems emerge that make training increasingly difficult.

The Exploding and Vanishing Gradient Problem

During backpropagation, gradients are computed by multiplying chains of partial derivatives, one for each layer the signal passes through. If each layer multiplies the gradient by a factor slightly greater than 1 (say, 1.1), then after 96 layers (the depth of GPT-3), the gradient is multiplied by 1.1^96, which is approximately 9,412. The gradient has exploded. Conversely, if each layer multiplies by a factor slightly less than 1 (say, 0.9), then after 96 layers the gradient is 0.9^96, which is approximately 0.0000366. The gradient has effectively vanished.

This is not a hypothetical problem. It is the central challenge of training deep neural networks, and it was the primary reason that networks deeper than about 20 layers were essentially untrainable before 2015.

The Activation Magnitude Problem

Even in the forward pass (before backpropagation), deep networks face instability. Each layer transforms its input through matrix multiplications and nonlinear activations. If the output of each layer is slightly larger than its input, the values grow exponentially as they pass through dozens of layers. If the output is slightly smaller, the values shrink toward zero. Either way, the numbers eventually leave the range where floating-point arithmetic is numerically stable, and the network produces garbage.

For a concrete example: in LLaMA 3 8B, which has 32 Transformer layers, each token’s vector passes through 32 attention operations and 32 FFN operations. If each operation increased the magnitude of the vector by just 5%, the vector’s magnitude after all 32 layers would be 1.05^32, which is approximately 4.8x larger than the input. That might seem manageable, but in a model with 96 layers (like GPT-3), the same 5% growth per layer would produce 1.05^96, which is approximately 107x. And in practice, the growth can be much more than 5% per layer without normalization.

The solution to both problems involves two complementary techniques: normalization (to keep values in a stable range) and residual connections (to provide gradient shortcuts that bypass the chain-multiplication problem).


Layer Normalization: Keeping Numbers Stable

What Normalization Does

The core idea of normalization is simple: after each major computation, rescale the numbers so they have a consistent magnitude. Instead of letting values drift higher and higher (or lower and lower) as they pass through layers, normalization resets them to a standard range after each step.

The most common form of normalization in deep learning is batch normalization, introduced by Ioffe and Szegedy (2015, arXiv:1502.03167). Batch normalization computes the mean and variance of activations across a batch of training examples, then normalizes each activation to have zero mean and unit variance. It was a breakthrough for training deep convolutional networks (like those used for image recognition), but it has a fundamental limitation for language models: it depends on the batch dimension.

In language modeling, different sequences in a batch have different lengths and different content. Computing statistics across the batch mixes information from unrelated sequences, which is problematic. More importantly, during inference (when generating text), the model processes one sequence at a time, so there is no batch to compute statistics over.

Source: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” ICML 2015 (arXiv:1502.03167).

Layer Normalization: Normalizing Within Each Token

Layer normalization (LayerNorm), introduced by Ba, Kiros, and Hinton (2016, arXiv:1607.06450), solves this by computing statistics within each individual token’s vector, rather than across the batch. For a single token with a vector x of dimension d:

mean = (1/d) * sum(x_i for i in 1..d)
variance = (1/d) * sum((x_i - mean)^2 for i in 1..d)
LayerNorm(x) = gamma * (x - mean) / sqrt(variance + epsilon) + beta

Where:

  • mean and variance are computed across the d dimensions of a single token’s vector
  • gamma (gain) and beta (bias) are learnable parameters of shape [d], one per dimension
  • epsilon is a small constant (typically 1e-5 or 1e-6) to prevent division by zero

The key property: LayerNorm operates on each token independently. It does not look at other tokens in the sequence or other examples in the batch. This makes it well-suited for autoregressive language models, where each token must be processed independently during generation.

After LayerNorm, each token’s vector has approximately zero mean and unit variance across its dimensions. The learnable parameters gamma and beta allow the model to shift and scale the normalized values, so the normalization does not permanently constrain the representation. The model can learn to undo the normalization if that is useful, but the default state is a well-behaved, stable range of values.

Source: Ba, Kiros, and Hinton, “Layer Normalization,” arXiv:1607.06450, July 2016.

RMSNorm: The Modern Standard

While LayerNorm was used in the original Transformer and in models like BERT and GPT-2, modern LLMs have largely switched to a simpler variant called RMSNorm (Root Mean Square Layer Normalization), introduced by Zhang and Sennrich (2019).

RMSNorm removes the mean-centering step from LayerNorm. Instead of subtracting the mean and dividing by the standard deviation, it simply divides by the root mean square (RMS) of the values:

RMS(x) = sqrt((1/d) * sum(x_i^2 for i in 1..d))
RMSNorm(x) = gamma * x / RMS(x)

Notice two simplifications compared to LayerNorm:

  1. No mean subtraction: RMSNorm does not compute or subtract the mean. It only normalizes by the magnitude (RMS) of the vector.
  2. No beta (bias) parameter: RMSNorm typically uses only the gain parameter gamma, not the bias beta.

Zhang and Sennrich (2019) showed that the mean-centering step in LayerNorm is often unnecessary. The re-scaling (dividing by the magnitude) is what provides the training stability benefits. Removing the mean computation makes RMSNorm computationally cheaper, because it avoids one pass over the data to compute the mean and another to subtract it.

The computational savings are modest for a single normalization operation, but they add up. In LLaMA 3 8B, there are 2 RMSNorm operations per layer (one before attention, one before the FFN) across 32 layers, plus a final RMSNorm before the output projection. That is 65 RMSNorm operations per forward pass. For LLaMA 4 Maverick with 48 layers, it is 97 RMSNorm operations. Across billions of tokens during training, the savings from using RMSNorm instead of LayerNorm are significant.

Source: Zhang and Sennrich, “Root Mean Square Layer Normalization,” NeurIPS 2019 (arXiv:1910.07467). RMSNorm achieves comparable performance to LayerNorm while being computationally simpler.

Which Models Use What

ModelNormalizationYear
Original TransformerLayerNorm (post-norm)2017
BERTLayerNorm (post-norm)2018
GPT-2LayerNorm (pre-norm)2019
GPT-3LayerNorm (pre-norm)2020
LLaMA 2RMSNorm (pre-norm)2023
Mistral 7BRMSNorm (pre-norm)2023
LLaMA 3 8BRMSNorm (pre-norm)2024
DeepSeek-V3RMSNorm (pre-norm)2024
LLaMA 4 MaverickRMSNorm (pre-norm)2025
LLaMA 4 ScoutRMSNorm (pre-norm)2025

The trend is clear: every major open-weight LLM released since 2023 uses RMSNorm. The “(pre-norm)” and “(post-norm)” labels refer to where the normalization is placed relative to the attention and FFN operations, which we will cover next.

Sources: Original Transformer from Vaswani et al. (2017). GPT-2 switched to pre-norm (Radford et al., 2019). LLaMA models use RMSNorm per Meta’s technical reports and HuggingFace Transformers Llama4TextConfig. Mistral 7B uses RMSNorm per its configuration. DeepSeek-V3 uses RMSNorm per its technical report (arXiv:2412.19437). LLaMA 4 uses RMSNorm with rms_norm_eps = 1e-5 per HuggingFace Transformers Llama4TextConfig.


Pre-Norm vs. Post-Norm: Where You Normalize Matters

The placement of normalization relative to the attention and FFN operations turns out to be critically important for training stability. There are two main approaches:

Post-Norm (Original Transformer)

In the original Transformer (Vaswani et al., 2017), normalization is applied after the residual addition:

output = LayerNorm(x + Sublayer(x))

Where Sublayer is either the attention operation or the FFN. The sequence is:

  1. Compute the sublayer output: Sublayer(x)
  2. Add the residual: x + Sublayer(x)
  3. Normalize the sum: LayerNorm(x + Sublayer(x))

This is called post-norm because the normalization comes after the sublayer and the residual addition.

Pre-Norm (Modern Standard)

In the pre-norm configuration, normalization is applied before the sublayer:

output = x + Sublayer(LayerNorm(x))

The sequence is:

  1. Normalize the input: LayerNorm(x)
  2. Compute the sublayer output: Sublayer(LayerNorm(x))
  3. Add the residual: x + Sublayer(LayerNorm(x))

This is called pre-norm because the normalization comes before the sublayer.

Why Pre-Norm Won

Xiong et al. (2020) published a detailed analysis titled “On Layer Normalization in the Transformer Architecture” (ICML 2020) that explained why pre-norm is more stable for training. Their key findings:

  1. In post-norm Transformers, the expected gradients of parameters near the output layer are large at initialization. This means the model needs a careful learning rate warm-up phase at the start of training: the learning rate must start very small and gradually increase. Without this warm-up, training is unstable and often diverges.

  2. In pre-norm Transformers, the gradients are well-behaved at initialization. The warm-up stage can be removed entirely, and training is stable from the start. This simplifies the training process and reduces the amount of hyperparameter tuning needed.

The intuitive reason: in pre-norm, the residual connection creates a direct path from the input to the output that bypasses the normalization. This means the gradient can flow directly through the residual path without being distorted by the normalization operation. In post-norm, the normalization sits on the main path, and the gradient must pass through it, which can amplify or suppress gradients in problematic ways.

Source: Xiong et al., “On Layer Normalization in the Transformer Architecture,” ICML 2020 (arXiv:2002.04745). Showed that pre-norm Transformers have well-behaved gradients at initialization and can be trained without learning rate warm-up.

GPT-2 (Radford et al., 2019) was one of the first major models to adopt pre-norm, and every significant language model since has followed suit. The combination of pre-norm placement with RMSNorm has become the universal standard in modern LLMs.

The Complete Pre-Norm Block

With pre-norm RMSNorm, a single Transformer layer looks like this:

# Attention sub-block
h = x + Attention(RMSNorm(x))

# FFN sub-block
output = h + FFN(RMSNorm(h))

Each sub-block has the same structure: normalize, compute, add residual. The normalization ensures the input to each sublayer is well-scaled, and the residual connection ensures the output preserves the original information plus whatever the sublayer computed.


Residual Connections: The Gradient Highway

The Problem Residual Connections Solve

Even with normalization, deep networks face a fundamental challenge: information and gradients must travel through many layers. In a 48-layer model like LLaMA 4 Maverick, a gradient signal from the loss function must travel backward through 48 attention operations and 48 FFN operations to reach the first layer. At each step, the gradient is transformed by the layer’s weights, and these transformations can distort or diminish the signal.

Residual connections (also called skip connections) solve this by providing a direct path that bypasses each layer’s computation. Instead of computing:

output = Layer(input)

A residual connection computes:

output = input + Layer(input)

The “input” term passes through unchanged, creating a shortcut that the gradient can flow through without any transformation. The “Layer(input)” term adds whatever the layer computed on top of the original input. The layer only needs to learn the residual: the difference between the desired output and the input. This is typically easier to learn than the full transformation from scratch.

Origin: ResNet

Residual connections were introduced by He et al. in their landmark paper “Deep Residual Learning for Image Recognition” (arXiv:1512.03385, December 2015; published at CVPR 2016). They demonstrated that residual connections enabled training of networks with 152 layers, which was 8x deeper than the previous state of the art (VGG networks with about 19 layers). Their 152-layer ResNet won first place in the ILSVRC 2015 image classification competition with 3.57% top-5 error on ImageNet.

The key insight from He et al.: without residual connections, adding more layers to a network actually made it perform worse (the “degradation problem”). A 56-layer network performed worse than a 20-layer network, not because of overfitting, but because the optimization process could not effectively train the deeper network. Residual connections solved this by making it easy for additional layers to learn the identity function (just output the input unchanged) if they have nothing useful to add. With residual connections, deeper networks could always perform at least as well as shallower ones.

Source: He et al., “Deep Residual Learning for Image Recognition,” CVPR 2016 (arXiv:1512.03385). Won 1st place on ILSVRC 2015 classification with 3.57% top-5 error using a 152-layer ResNet.

How Residual Connections Help Gradients

The mathematical reason residual connections help is straightforward. Consider a simple chain of layers without residual connections:

y = f_3(f_2(f_1(x)))

The gradient of y with respect to x is:

dy/dx = f_3' * f_2' * f_1'

This is a product of three terms. If any of them is small, the entire gradient shrinks. If any is large, the entire gradient grows. With 48 or 96 layers, this product of many terms is extremely unstable.

Now consider the same chain with residual connections:

y_1 = x + f_1(x)
y_2 = y_1 + f_2(y_1)
y_3 = y_2 + f_3(y_2)

The gradient of y_3 with respect to x includes a direct path:

dy_3/dx = 1 + (terms involving f_1', f_2', f_3')

The “1” term is the gradient flowing directly through the residual connections, bypassing all three layers. This term is always exactly 1, regardless of what the layers compute. Even if the layer gradients f_1’, f_2’, f_3’ are all very small, the total gradient is at least 1. The gradient cannot vanish.

In a real Transformer with 48 layers, the residual connections create 48 parallel paths for the gradient to flow through. The gradient at any layer is the sum of contributions from all possible paths through the residual connections, and the direct path (through all residual connections, bypassing all sublayers) always contributes a gradient of exactly 1. This is why residual connections are sometimes called a “gradient highway”: they provide an unobstructed path for gradients to flow from the loss all the way back to the first layer.

Residual Connections in the Transformer

The original Transformer uses residual connections around both the attention sublayer and the FFN sublayer. In the pre-norm configuration used by modern models:

# Residual around attention
h = x + Attention(RMSNorm(x))
      ^                        
      |__ this is the residual connection: add the original input

# Residual around FFN
output = h + FFN(RMSNorm(h))
             ^
             |__ another residual connection

Each Transformer layer has two residual connections: one around the attention operation and one around the FFN. In LLaMA 4 Maverick with 48 layers, that is 96 residual connections in total. Each one provides a direct path for information and gradients to flow through.

The Residual Stream

A useful way to think about residual connections in a Transformer is the concept of the residual stream, a framing popularized by Elhage et al. (2021) in their work on mechanistic interpretability at Anthropic. The initial token embedding creates a vector (the “stream”) that flows through the entire model. At each layer, the attention and FFN operations read from this stream and write their outputs back to it by addition.

stream = embedding(token)                    # initial vector

for each layer:
    stream = stream + Attention(RMSNorm(stream))   # attention reads and writes
    stream = stream + FFN(RMSNorm(stream))         # FFN reads and writes

output = RMSNorm(stream)                     # final normalization
logits = output @ W_output                   # project to vocabulary

The stream accumulates information as it passes through layers. Early layers might add basic syntactic information. Middle layers might add semantic and factual information. Later layers might add task-specific reasoning. Each layer’s contribution is added to the stream, and the final output is the sum of the original embedding plus all the contributions from all layers.

This additive structure has an important consequence: the model can, in principle, learn to “skip” layers that are not useful for a particular input. If a layer’s attention and FFN outputs are near zero for a given token, the residual connection passes the token’s vector through essentially unchanged. The model does not need to use every layer for every token.

Source: Elhage et al., “A Mathematical Framework for Transformer Circuits,” Anthropic, 2021 (transformer-circuits.pub). Introduced the “residual stream” framing for understanding how information flows through Transformer layers.


The Complete Transformer Block

Now we can put all four components together to form the complete Transformer block. This is the fundamental repeating unit of every modern language model.

The Block Structure

Using the pre-norm RMSNorm configuration that all modern LLMs use:

Input: x (shape [seq_len, hidden_size])

Step 1: Pre-attention normalization
  x_norm = RMSNorm(x)

Step 2: Multi-head attention (Chapter 8)
  attn_out = MultiHeadAttention(x_norm)

Step 3: Residual connection
  h = x + attn_out

Step 4: Pre-FFN normalization
  h_norm = RMSNorm(h)

Step 5: Feed-forward network (Chapter 9)
  ffn_out = SwiGLU_FFN(h_norm)

Step 6: Residual connection
  output = h + ffn_out

Output: output (shape [seq_len, hidden_size])

The output has the same shape as the input. This is essential because it allows blocks to be stacked: the output of one block becomes the input to the next. The hidden_size dimension remains constant throughout the entire model, from the first layer to the last.

What Each Component Contributes

ComponentPurposeParameters (LLaMA 3 8B)
RMSNorm (pre-attention)Stabilize input to attention4,096 (gamma only)
Multi-Head Attention (GQA)Gather information from other tokens41,943,040
RMSNorm (pre-FFN)Stabilize input to FFN4,096 (gamma only)
SwiGLU FFNProcess each token independently176,160,768
Residual connectionsGradient flow, information preservation0 (no parameters)

Total per block: approximately 218.1 million parameters.

Notice that the two RMSNorm layers contribute only 8,192 parameters per block (4,096 each), which is negligible compared to the attention and FFN parameters. The residual connections add zero parameters; they are purely structural. The vast majority of each block’s parameters are in the FFN (80.7%, as computed in Chapter 9) and the attention mechanism (19.2%).

Source: LLaMA 3 8B architecture from Meta (April 18, 2024): hidden_size = 4,096, intermediate_size = 14,336, num_attention_heads = 32, num_key_value_heads = 8, head_dim = 128, num_hidden_layers = 32.


Stacking Blocks: From One Layer to a Hundred

A single Transformer block performs one round of attention (inter-token communication) followed by one round of FFN processing (per-token computation). This is useful but limited. A single attention operation can only capture direct relationships between tokens. A single FFN can only perform a relatively simple transformation. To build the deep understanding needed for language modeling, models stack many blocks on top of each other.

How Many Layers Do Real Models Use?

ModelLayershidden_sizeTotal ParametersYear
Original Transformer6 + 6 (enc + dec)512~63M2017
GPT-2 (small)12768117M2019
GPT-2 (XL)481,6001.5B2019
GPT-39612,288175B2020
LLaMA 3 8B324,0968B2024
Mistral 7B324,0967.3B2023
DeepSeek-V3617,168671B (37B active)2024
LLaMA 4 Scout485,120109B (17B active)2025
LLaMA 4 Maverick485,120400B (17B active)2025

The trend is clear: as models have grown larger, they have gotten both wider (larger hidden_size) and deeper (more layers). GPT-3 used 96 layers with a hidden dimension of 12,288. Modern MoE models like LLaMA 4 use 48 layers with a hidden dimension of 5,120, but compensate with many expert FFN blocks per layer.

Sources: Original Transformer from Vaswani et al. (2017), 6 encoder + 6 decoder layers, d_model = 512. GPT-2 from Radford et al. (2019). GPT-3 from Brown et al. (2020), 96 layers, d_model = 12,288, 96 attention heads. LLaMA 3 8B from Meta (April 2024), 32 layers, hidden_size = 4,096. Mistral 7B from Mistral AI (September 2023), 32 layers, hidden_size = 4,096. DeepSeek-V3 from technical report (arXiv:2412.19437), 61 layers, hidden_size = 7,168, 671B total / 37B active. LLaMA 4 from HuggingFace Transformers Llama4TextConfig: 48 layers, hidden_size = 5,120.

The Full Model Architecture

A complete Transformer language model is not just a stack of blocks. It also includes an embedding layer at the input, a final normalization layer, and an output projection. Here is the full architecture for LLaMA 4 Maverick:

1. Token Embedding (Chapter 5)
   Input: token IDs, shape [seq_len]
   Embedding table: [202,048 x 5,120]
   Output: [seq_len x 5,120]

2. Positional Encoding (Chapter 6)
   RoPE applied inside each attention layer (not a separate step)

3. Transformer Blocks x 48
   For each of the 48 layers:
     a) RMSNorm (gamma: [5,120])
     b) Multi-Head Attention with GQA
        40 query heads, 8 KV heads, head_dim = 128
     c) Residual connection (add)
     d) RMSNorm (gamma: [5,120])
     e) FFN (SwiGLU, dense or MoE depending on layer)
        Dense layers: intermediate_size_mlp = 16,384
        MoE layers: intermediate_size = 8,192 per expert, 128 routed experts
     f) Residual connection (add)

4. Final RMSNorm
   gamma: [5,120]
   Normalizes the output of the last Transformer block

5. Output Projection (Language Model Head)
   Weight matrix: [5,120 x 202,048]
   Produces logits (raw scores) for each token in the vocabulary
   Softmax converts logits to probabilities

Output: [seq_len x 202,048] probability distribution over vocabulary

The final RMSNorm (step 4) is important. Without it, the output of the last Transformer block might have an unpredictable magnitude, which would make the output projection unstable. The final normalization ensures the input to the output projection is well-scaled.

In many models, the output projection matrix in step 5 shares weights with the embedding table in step 1 (this is called weight tying or tied embeddings). When weight tying is used, the same [vocab_size x hidden_size] matrix serves double duty: as the embedding lookup table (mapping token IDs to vectors) and as the output projection (mapping vectors back to vocabulary scores). This reduces the total parameter count and provides a useful inductive bias: tokens with similar embeddings will have similar output probabilities. However, not all models use weight tying. LLaMA 3, LLaMA 4, and many other modern LLMs keep the embedding and output projection as separate matrices (tie_word_embeddings = False in their configurations), which adds a significant number of parameters but gives the model more flexibility.

Source: LLaMA 4 Maverick from HuggingFace Transformers Llama4TextConfig and Ollama model metadata: vocab_size = 202,048, hidden_size = 5,120, num_hidden_layers = 48, num_attention_heads = 40, num_key_value_heads = 8, head_dim = 128, rms_norm_eps = 1e-5. Released April 5, 2025.


What Each Layer Depth Captures

One of the most fascinating findings in the study of Transformer models is that different layers specialize in different types of linguistic processing. This is not something that is explicitly programmed; it emerges naturally from training.

The NLP Pipeline Analogy

Tenney et al. (2019) published a landmark study titled “BERT Rediscovers the Classical NLP Pipeline” (ACL 2019, arXiv:1905.05950). They used probing classifiers to test what linguistic information is captured at each layer of BERT, and found a striking pattern:

  • Early layers (layers 1-4 in a 12-layer model): Capture basic syntactic information like part-of-speech tags. These layers learn to distinguish nouns from verbs, adjectives from adverbs, and so on.
  • Middle layers (layers 5-8): Capture more complex syntactic structure like parse trees and dependency relations, as well as named entity recognition. These layers understand which words modify which other words, how phrases are nested, and which tokens are names of people, places, or organizations.
  • Later layers (layers 9-12): Capture higher-level semantic information like semantic roles (who did what to whom) and coreference resolution (which pronouns refer to which nouns).

This progression mirrors the traditional NLP pipeline, where text processing proceeds from low-level syntax to high-level semantics. The Transformer learns to organize its computation in a similar hierarchy, even though no one told it to.

Source: Tenney et al., “BERT Rediscovers the Classical NLP Pipeline,” ACL 2019 (arXiv:1905.05950). Found that linguistic tasks appear in the expected sequence across layers: POS tagging, parsing, NER, semantic roles, then coreference.

Modern Confirmation

More recent work has confirmed that this hierarchical organization persists in modern, much larger models. Li and Subramani (2025), in a study titled “Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?” (arXiv:2506.02132), analyzed 25 models spanning from classical architectures (BERT, GPT-2) to modern LLMs (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama 3.1). They found that even in current decoder-only models, the same pattern holds: early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena such as coreference resolution and semantic relations. Their analysis also revealed that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional morphology remains linearly accessible throughout all layers.

Source: Li and Subramani, “Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?”, arXiv:2506.02132, June 2025. Analyzed 25 models spanning classical architectures (BERT, GPT-2) to modern LLMs (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama 3.1) across eight linguistic tasks and confirmed hierarchical organization persists in modern LLMs.

This has practical implications. When researchers want to extract syntactic information from a language model (for example, to build a parser), they get the best results from early or middle layers. When they want semantic information (for example, for question answering), later layers are more useful. The model’s depth is not redundant; each layer contributes a different level of linguistic understanding.

What This Means for Model Depth

The hierarchical organization of layers helps explain why deeper models are generally more capable:

  • A 12-layer model can capture basic syntax and some semantics.
  • A 32-layer model (like LLaMA 3 8B) has enough depth for sophisticated semantic understanding and factual knowledge retrieval.
  • A 48-layer model (like LLaMA 4 Maverick) has additional capacity for complex reasoning and multi-step inference.
  • A 96-layer model (like GPT-3) has even more room for nuanced understanding and generation.

However, depth alone is not sufficient. The hidden dimension (width) determines how much information each layer can process and store. A very deep but narrow model would have many processing steps but limited capacity at each step. A very wide but shallow model would have high capacity per step but limited ability to build hierarchical representations. Modern models balance depth and width to achieve the best performance for a given parameter budget.


Hands-On: Implementing the Complete Transformer Block

Let’s implement a complete Transformer block from scratch, combining RMSNorm, multi-head attention with GQA, SwiGLU FFN, and residual connections:

import numpy as np

def rms_norm(x, gamma, eps=1e-5):
    """RMSNorm: normalize by root mean square, then scale by gamma.
    
    x: input, shape [seq_len, hidden_size] or [hidden_size]
    gamma: learnable scale parameter, shape [hidden_size]
    """
    rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / np.sum(e, axis=axis, keepdims=True)

def swish(x):
    """Swish/SiLU activation."""
    return x * (1 / (1 + np.exp(-np.clip(x, -500, 500))))

def attention(Q, K, V, mask=None):
    """Scaled dot-product attention."""
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)
    return softmax(scores) @ V

def gqa_attention(x, W_Q, W_K, W_V, W_O, n_q_heads, n_kv_heads, head_dim, mask):
    """Grouped Query Attention."""
    seq_len = x.shape[0]
    group_size = n_q_heads // n_kv_heads
    Q = (x @ W_Q).reshape(seq_len, n_q_heads, head_dim)
    K = (x @ W_K).reshape(seq_len, n_kv_heads, head_dim)
    V = (x @ W_V).reshape(seq_len, n_kv_heads, head_dim)
    heads = []
    for q in range(n_q_heads):
        kv = q // group_size
        heads.append(attention(Q[:, q], K[:, kv], V[:, kv], mask))
    return np.concatenate(heads, axis=-1) @ W_O

def swiglu_ffn(x, W_gate, W_up, W_down):
    """SwiGLU FFN."""
    return (swish(x @ W_gate) * (x @ W_up)) @ W_down

def transformer_block(x, params, mask):
    """One complete Transformer block: Norm -> Attn -> Add -> Norm -> FFN -> Add.
    
    x: input, shape [seq_len, hidden_size]
    params: dict of weight matrices and normalization parameters
    mask: causal mask, shape [seq_len, seq_len]
    """
    # Step 1-3: RMSNorm -> Attention -> Residual
    x_norm = rms_norm(x, params['gamma_attn'])
    attn_out = gqa_attention(
        x_norm,
        params['W_Q'], params['W_K'], params['W_V'], params['W_O'],
        params['n_q_heads'], params['n_kv_heads'], params['head_dim'],
        mask
    )
    h = x + attn_out  # residual connection

    # Step 4-6: RMSNorm -> FFN -> Residual
    h_norm = rms_norm(h, params['gamma_ffn'])
    ffn_out = swiglu_ffn(h_norm, params['W_gate'], params['W_up'], params['W_down'])
    output = h + ffn_out  # residual connection

    return output


# Build a small Transformer block matching LLaMA-style architecture
np.random.seed(42)
seq_len = 6
hidden_size = 64
n_q_heads = 4
n_kv_heads = 2
head_dim = hidden_size // n_q_heads  # 16
intermediate_size = int(hidden_size * 3.5)  # 224, matching LLaMA's ~3.5x ratio

# Initialize parameters
scale_attn = 0.02  # small initialization, as in real models
scale_ffn = 0.02
params = {
    'gamma_attn': np.ones(hidden_size),
    'gamma_ffn': np.ones(hidden_size),
    'W_Q': np.random.randn(hidden_size, n_q_heads * head_dim) * scale_attn,
    'W_K': np.random.randn(hidden_size, n_kv_heads * head_dim) * scale_attn,
    'W_V': np.random.randn(hidden_size, n_kv_heads * head_dim) * scale_attn,
    'W_O': np.random.randn(n_q_heads * head_dim, hidden_size) * scale_attn,
    'W_gate': np.random.randn(hidden_size, intermediate_size) * scale_ffn,
    'W_up': np.random.randn(hidden_size, intermediate_size) * scale_ffn,
    'W_down': np.random.randn(intermediate_size, hidden_size) * scale_ffn,
    'n_q_heads': n_q_heads,
    'n_kv_heads': n_kv_heads,
    'head_dim': head_dim,
}

# Create input and causal mask
x = np.random.randn(seq_len, hidden_size) * 0.5
mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)

# Run one Transformer block
output = transformer_block(x, params, mask)

print(f"Input shape:  {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Shapes match: {x.shape == output.shape}")
print()

# Verify the residual connection preserves information
# The output should be close to the input (since weights are random and small)
diff = np.mean(np.abs(output - x))
print(f"Mean absolute difference from input: {diff:.4f}")
print(f"Mean absolute value of input:        {np.mean(np.abs(x)):.4f}")
print("(With random weights, the block's contribution is small,")
print(" so the output is close to the input via the residual connection.)")
print()

# Count parameters
n_params = (
    hidden_size +  # gamma_attn
    hidden_size +  # gamma_ffn
    hidden_size * n_q_heads * head_dim +  # W_Q
    hidden_size * n_kv_heads * head_dim +  # W_K
    hidden_size * n_kv_heads * head_dim +  # W_V
    n_q_heads * head_dim * hidden_size +   # W_O
    hidden_size * intermediate_size +       # W_gate
    hidden_size * intermediate_size +       # W_up
    intermediate_size * hidden_size         # W_down
)
print(f"Total parameters in this block: {n_params:,}")
print(f"  RMSNorm params: {2 * hidden_size:,} ({2 * hidden_size / n_params:.1%})")
attn_params = (hidden_size * n_q_heads * head_dim +
               2 * hidden_size * n_kv_heads * head_dim +
               n_q_heads * head_dim * hidden_size)
ffn_params = 3 * hidden_size * intermediate_size
print(f"  Attention params: {attn_params:,} ({attn_params / n_params:.1%})")
print(f"  FFN params: {ffn_params:,} ({ffn_params / n_params:.1%})")

When you run this, you will see that the output has the same shape as the input (confirming blocks can be stacked), and that the output is relatively close to the input. This second observation demonstrates the residual connection at work: with randomly initialized weights, the attention and FFN outputs are small, so the residual connection dominates and the output is approximately equal to the input. During training, the model gradually learns to make the attention and FFN contributions more meaningful.


Stacking Multiple Blocks

Let’s extend our implementation to stack multiple Transformer blocks and observe how the representation evolves through layers:

import numpy as np

def rms_norm(x, gamma, eps=1e-5):
    rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)

def softmax(x, axis=-1):
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / np.sum(e, axis=axis, keepdims=True)

def swish(x):
    return x * (1 / (1 + np.exp(-np.clip(x, -500, 500))))

def attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)
    return softmax(scores) @ V

def gqa_block(x, params, mask):
    """One Transformer block."""
    seq_len = x.shape[0]
    n_q, n_kv, hd = params['n_q'], params['n_kv'], params['hd']
    group_size = n_q // n_kv

    # Norm -> Attention -> Residual
    xn = rms_norm(x, params['g1'])
    Q = (xn @ params['WQ']).reshape(seq_len, n_q, hd)
    K = (xn @ params['WK']).reshape(seq_len, n_kv, hd)
    V = (xn @ params['WV']).reshape(seq_len, n_kv, hd)
    heads = [attention(Q[:, q], K[:, q // group_size], V[:, q // group_size], mask)
             for q in range(n_q)]
    h = x + np.concatenate(heads, axis=-1) @ params['WO']

    # Norm -> FFN -> Residual
    hn = rms_norm(h, params['g2'])
    return h + (swish(hn @ params['Wg']) * (hn @ params['Wu'])) @ params['Wd']


# Configuration
np.random.seed(42)
seq_len, hidden, n_q, n_kv, hd = 8, 64, 4, 2, 16
inter = int(hidden * 3.5)
n_layers = 8
mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)

# Initialize all layers
s = (2 / hidden) ** 0.5
layers = []
for _ in range(n_layers):
    layers.append({
        'g1': np.ones(hidden), 'g2': np.ones(hidden),
        'WQ': np.random.randn(hidden, n_q * hd) * s,
        'WK': np.random.randn(hidden, n_kv * hd) * s,
        'WV': np.random.randn(hidden, n_kv * hd) * s,
        'WO': np.random.randn(n_q * hd, hidden) * s,
        'Wg': np.random.randn(hidden, inter) * s,
        'Wu': np.random.randn(hidden, inter) * s,
        'Wd': np.random.randn(inter, hidden) * (2 / inter) ** 0.5,
        'n_q': n_q, 'n_kv': n_kv, 'hd': hd,
    })

# Forward pass through all layers, tracking statistics
x = np.random.randn(seq_len, hidden) * 0.5
print(f"{'Layer':<8} {'Mean':>10} {'Std':>10} {'Max':>10} {'RMS':>10}")
print(f"{'Input':<8} {np.mean(x):>10.4f} {np.std(x):>10.4f} "
      f"{np.max(np.abs(x)):>10.4f} {np.sqrt(np.mean(x**2)):>10.4f}")

for i, layer_params in enumerate(layers):
    x = gqa_block(x, layer_params, mask)
    print(f"{'Layer ' + str(i+1):<8} {np.mean(x):>10.4f} {np.std(x):>10.4f} "
          f"{np.max(np.abs(x)):>10.4f} {np.sqrt(np.mean(x**2)):>10.4f}")

# Final normalization (as in real models)
gamma_final = np.ones(hidden)
x_final = rms_norm(x, gamma_final)
print(f"\n{'Final':<8} {np.mean(x_final):>10.4f} {np.std(x_final):>10.4f} "
      f"{np.max(np.abs(x_final)):>10.4f} {np.sqrt(np.mean(x_final**2)):>10.4f}")
print("\nNote: RMSNorm at each layer keeps the values from exploding or vanishing.")
print("Without normalization, values would grow exponentially through 8 layers.")

This code demonstrates how RMSNorm and residual connections work together. You will notice that the RMS of the activations grows through the 8 layers (because each layer’s contribution is added via the residual connection), but the growth is moderate and controlled. The RMSNorm before each sublayer ensures that the inputs to the attention and FFN operations are well-scaled, even as the residual stream accumulates information. The final RMSNorm at the end brings the output back to unit RMS. Without any normalization, the growth would be much more severe and unpredictable, eventually making the network untrainable.


Visualizing the Residual Stream

Let’s visualize how the residual stream accumulates information through layers:

import numpy as np
import matplotlib.pyplot as plt

def rms_norm(x, gamma, eps=1e-5):
    rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)

# Simulate a simplified Transformer with 12 layers
np.random.seed(42)
hidden_size = 32
n_layers = 12

# Initial embedding
x = np.random.randn(hidden_size) * 0.5

# Track the residual stream and each layer's contribution
stream_history = [x.copy()]
contributions = []

for layer in range(n_layers):
    gamma = np.ones(hidden_size)
    x_norm = rms_norm(x, gamma)

    # Simulate attention + FFN contribution (random for illustration)
    W = np.random.randn(hidden_size, hidden_size) * 0.15
    contribution = np.tanh(x_norm @ W) * 0.3  # small contribution

    x = x + contribution  # residual connection
    stream_history.append(x.copy())
    contributions.append(contribution.copy())

# Plot 1: Residual stream magnitude through layers
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

magnitudes = [np.linalg.norm(s) for s in stream_history]
axes[0].plot(range(len(magnitudes)), magnitudes, 'b-o', markersize=5)
axes[0].set_xlabel('Layer')
axes[0].set_ylabel('Vector magnitude (L2 norm)')
axes[0].set_title('Residual Stream Magnitude')
axes[0].set_xticks(range(0, n_layers + 1, 2))
axes[0].grid(True, alpha=0.3)

# Plot 2: Each layer's contribution magnitude
contrib_mags = [np.linalg.norm(c) for c in contributions]
axes[1].bar(range(1, n_layers + 1), contrib_mags, color='orange', alpha=0.7)
axes[1].set_xlabel('Layer')
axes[1].set_ylabel('Contribution magnitude (L2 norm)')
axes[1].set_title("Each Layer's Contribution to the Stream")
axes[1].set_xticks(range(1, n_layers + 1))
axes[1].grid(True, alpha=0.3, axis='y')

# Plot 3: Cosine similarity between stream at each layer and final output
final = stream_history[-1]
similarities = []
for s in stream_history:
    cos_sim = np.dot(s, final) / (np.linalg.norm(s) * np.linalg.norm(final) + 1e-8)
    similarities.append(cos_sim)
axes[2].plot(range(len(similarities)), similarities, 'g-o', markersize=5)
axes[2].set_xlabel('Layer')
axes[2].set_ylabel('Cosine similarity with final output')
axes[2].set_title('How the Stream Converges to Final Output')
axes[2].set_xticks(range(0, n_layers + 1, 2))
axes[2].set_ylim(0, 1.05)
axes[2].grid(True, alpha=0.3)

plt.suptitle('The Residual Stream Through 12 Transformer Layers', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('residual_stream.png', dpi=150, bbox_inches='tight')
plt.show()
print("Plot saved to residual_stream.png")

The three plots show:

  1. Stream magnitude: The residual stream’s magnitude grows gradually as each layer adds its contribution. With normalization and small contributions, this growth is controlled and stable.

  2. Layer contributions: Each layer adds a relatively small contribution to the stream. The residual connection ensures that no single layer can dramatically alter the representation; each layer makes an incremental refinement.

  3. Convergence: The cosine similarity between the stream at each layer and the final output increases through layers, showing how the representation gradually converges toward the final output. Early layers are less similar to the final output (they have not yet accumulated enough information), while later layers are very similar (most of the computation is done).


The RMSNorm Epsilon: A Small but Critical Detail

You may have noticed the epsilon parameter (eps) in the RMSNorm formula:

RMS(x) = sqrt((1/d) * sum(x_i^2) + eps)

This tiny constant (typically 1e-5 or 1e-6) prevents division by zero when all elements of x happen to be zero or very close to zero. Without epsilon, a zero vector would produce RMS = 0, and dividing by zero would produce infinity or NaN (Not a Number), crashing the computation.

In practice, the epsilon value is a model configuration parameter. LLaMA 4 uses rms_norm_eps = 1e-5 (0.00001). This is small enough to have negligible effect on normal computations but large enough to prevent numerical instability.

Source: LLaMA 4 configuration from HuggingFace Transformers Llama4TextConfig: rms_norm_eps = 1e-05.


Worked Example: Tracing Through a Complete Block

Let’s trace through a single Transformer block with concrete numbers, using a tiny hidden_size of 4 for readability. This example shows exactly how normalization, attention, residual connections, and the FFN interact.

Setup

Input vector for a single token (after previous layers):

x = [1.2, -0.8, 0.5, 0.3]

RMSNorm gamma parameters:

gamma_attn = [1.0, 1.0, 1.0, 1.0]   (initialized to 1.0)
gamma_ffn  = [1.0, 1.0, 1.0, 1.0]

Step 1: Pre-Attention RMSNorm

Compute RMS of x:

RMS(x) = sqrt((1.2^2 + (-0.8)^2 + 0.5^2 + 0.3^2) / 4 + 1e-5)
       = sqrt((1.44 + 0.64 + 0.25 + 0.09) / 4 + 1e-5)
       = sqrt(2.42 / 4 + 1e-5)
       = sqrt(0.605 + 0.00001)
       = sqrt(0.60501)
       = 0.7778

Normalize:

x_norm = gamma * x / RMS(x)
       = [1.0, 1.0, 1.0, 1.0] * [1.2, -0.8, 0.5, 0.3] / 0.7778
       = [1.543, -1.029, 0.643, 0.386]

The normalized vector has a larger magnitude in its dominant dimensions and a smaller magnitude in its weaker dimensions, but the overall RMS is now approximately 1.0. This ensures the attention computation receives well-scaled inputs.

Step 2: Attention (Simplified)

For this example, assume the attention operation (using the normalized input) produces:

attn_out = [0.15, -0.08, 0.22, -0.05]

(In a real model, this would involve Q, K, V projections, scaled dot-product attention, and the output projection, as detailed in Chapter 8.)

Step 3: Residual Connection (Post-Attention)

h = x + attn_out
  = [1.2, -0.8, 0.5, 0.3] + [0.15, -0.08, 0.22, -0.05]
  = [1.35, -0.88, 0.72, 0.25]

The original input x is preserved, and the attention output is added on top. The attention contribution is relatively small compared to the input, which is typical: each layer makes an incremental refinement rather than a dramatic transformation.

Step 4: Pre-FFN RMSNorm

Compute RMS of h:

RMS(h) = sqrt((1.35^2 + (-0.88)^2 + 0.72^2 + 0.25^2) / 4 + 1e-5)
       = sqrt((1.8225 + 0.7744 + 0.5184 + 0.0625) / 4 + 1e-5)
       = sqrt(3.1778 / 4 + 1e-5)
       = sqrt(0.79445 + 0.00001)
       = 0.8913

Normalize:

h_norm = gamma * h / RMS(h)
       = [1.35, -0.88, 0.72, 0.25] / 0.8913
       = [1.515, -0.987, 0.808, 0.280]

Step 5: FFN (Simplified)

Assume the SwiGLU FFN produces:

ffn_out = [-0.12, 0.18, -0.06, 0.14]

Step 6: Residual Connection (Post-FFN)

output = h + ffn_out
       = [1.35, -0.88, 0.72, 0.25] + [-0.12, 0.18, -0.06, 0.14]
       = [1.23, -0.70, 0.66, 0.39]

The final output of this Transformer block is [1.23, -0.70, 0.66, 0.39]. This becomes the input to the next block. Notice how the output is similar to the original input [1.2, -0.8, 0.5, 0.3] but has been refined by the attention and FFN operations. The residual connections ensure that the core information is preserved while the sublayers add incremental updates.


Real Parameter Counts: The Full Model

Let’s compute the complete parameter count for LLaMA 3 8B, including all components:

Per-Layer Parameters

ComponentCalculationParameters
RMSNorm (pre-attention)4,0964,096
W_Q4,096 x 4,09616,777,216
W_K4,096 x 1,0244,194,304
W_V4,096 x 1,0244,194,304
W_O4,096 x 4,09616,777,216
RMSNorm (pre-FFN)4,0964,096
W_gate4,096 x 14,33658,720,256
W_up4,096 x 14,33658,720,256
W_down14,336 x 4,09658,720,256
Layer total218,112,000

Model-Wide Parameters

ComponentCalculationParameters
Token embedding128,256 x 4,096525,336,576
32 Transformer layers32 x 218,112,0006,979,584,000
Final RMSNorm4,0964,096
Output projection128,256 x 4,096525,336,576
Model total~8.03 billion

LLaMA 3 8B does not use weight tying (tie_word_embeddings = False in its configuration), so the output projection is a separate matrix with the same shape as the embedding table. This accounts for the full ~8 billion parameter count.

The breakdown shows that the 32 Transformer layers contain about 87% of the model’s parameters. The embedding table and output projection together account for about 13%. The RMSNorm layers and residual connections together contribute a negligible fraction of the total parameters.

Source: LLaMA 3 8B from Meta (April 18, 2024): vocab_size = 128,256, hidden_size = 4,096, intermediate_size = 14,336, num_attention_heads = 32, num_key_value_heads = 8, head_dim = 128, num_hidden_layers = 32, tie_word_embeddings = False.


Hands-On: Comparing Normalization Strategies

Let’s implement and compare LayerNorm, RMSNorm, and no normalization to see the practical difference:

import numpy as np

def layer_norm(x, gamma, beta, eps=1e-5):
    """Standard LayerNorm: center, scale, then apply learnable params."""
    mean = np.mean(x, axis=-1, keepdims=True)
    var = np.var(x, axis=-1, keepdims=True)
    return gamma * (x - mean) / np.sqrt(var + eps) + beta

def rms_norm(x, gamma, eps=1e-5):
    """RMSNorm: scale by RMS, then apply learnable gamma."""
    rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)

def simulate_deep_network(x, n_layers, norm_fn):
    """Simulate a deep network with the given normalization."""
    hidden = x.shape[-1]
    magnitudes = [np.sqrt(np.mean(x ** 2))]

    for _ in range(n_layers):
        # Simulate a layer transformation (random matrix multiply)
        W = np.random.randn(hidden, hidden) * (2 / hidden) ** 0.5
        contribution = np.tanh(x @ W) * 0.5

        x = x + contribution  # residual connection

        if norm_fn is not None:
            x = norm_fn(x)

        magnitudes.append(np.sqrt(np.mean(x ** 2)))

    return magnitudes


np.random.seed(42)
hidden_size = 64
n_layers = 32
x_init = np.random.randn(8, hidden_size) * 0.5  # 8 tokens

# Normalization functions with pre-initialized parameters
gamma = np.ones(hidden_size)
beta = np.zeros(hidden_size)

results = {}

# No normalization
np.random.seed(42)
results['No normalization'] = simulate_deep_network(
    x_init.copy(), n_layers, norm_fn=None
)

# LayerNorm
np.random.seed(42)
results['LayerNorm'] = simulate_deep_network(
    x_init.copy(), n_layers,
    norm_fn=lambda x: layer_norm(x, gamma, beta)
)

# RMSNorm
np.random.seed(42)
results['RMSNorm'] = simulate_deep_network(
    x_init.copy(), n_layers,
    norm_fn=lambda x: rms_norm(x, gamma)
)

# Print results
print(f"{'Layer':<8} {'No Norm':>12} {'LayerNorm':>12} {'RMSNorm':>12}")
print("-" * 46)
for i in range(0, n_layers + 1, 4):
    print(f"{i:<8} {results['No normalization'][i]:>12.4f} "
          f"{results['LayerNorm'][i]:>12.4f} "
          f"{results['RMSNorm'][i]:>12.4f}")

print(f"\nFinal RMS values after {n_layers} layers:")
for name, mags in results.items():
    print(f"  {name}: {mags[-1]:.4f}")
print("\nWith normalization, values stay stable.")
print("Without normalization, values drift unpredictably.")

This experiment shows the practical effect of normalization. Without any normalization, the RMS of the activations drifts unpredictably through 32 layers, sometimes growing large and sometimes shrinking. With LayerNorm or RMSNorm, the RMS stays in a controlled range, ensuring stable computation throughout the network.


The Transformer Block in the Full Pipeline

Let’s put everything together and trace a complete forward pass through a small but complete Transformer model, from token IDs to next-token probabilities:

import numpy as np

def rms_norm(x, gamma, eps=1e-5):
    rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)

def softmax(x, axis=-1):
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / np.sum(e, axis=axis, keepdims=True)

def swish(x):
    return x * (1 / (1 + np.exp(-np.clip(x, -500, 500))))

def attention(Q, K, V, mask):
    scores = Q @ K.T / np.sqrt(Q.shape[-1])
    scores = np.where(mask, -1e9, scores)
    return softmax(scores) @ V

def forward_pass(token_ids, embedding_table, layers, final_gamma):
    """Complete forward pass through a Transformer LM."""
    # Step 1: Embedding lookup
    x = embedding_table[token_ids]  # [seq_len, hidden]
    seq_len, hidden = x.shape
    mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)

    # Step 2: Pass through each Transformer layer
    for p in layers:
        n_q, n_kv, hd = p['n_q'], p['n_kv'], p['hd']
        gs = n_q // n_kv

        # Norm -> Attention -> Residual
        xn = rms_norm(x, p['g1'])
        Q = (xn @ p['WQ']).reshape(seq_len, n_q, hd)
        K = (xn @ p['WK']).reshape(seq_len, n_kv, hd)
        V = (xn @ p['WV']).reshape(seq_len, n_kv, hd)
        heads = [attention(Q[:, q], K[:, q // gs], V[:, q // gs], mask)
                 for q in range(n_q)]
        x = x + np.concatenate(heads, axis=-1) @ p['WO']

        # Norm -> FFN -> Residual
        hn = rms_norm(x, p['g2'])
        x = x + (swish(hn @ p['Wg']) * (hn @ p['Wu'])) @ p['Wd']

    # Step 3: Final normalization
    x = rms_norm(x, final_gamma)

    # Step 4: Output projection (using embedding table = weight tying)
    logits = x @ embedding_table.T  # [seq_len, vocab_size]

    # Step 5: Softmax to get probabilities
    probs = softmax(logits, axis=-1)
    return logits, probs


# Build a tiny model
np.random.seed(42)
vocab_size = 20
hidden = 32
n_q, n_kv, hd = 4, 2, 8
inter = int(hidden * 3.5)
n_layers = 4

embedding_table = np.random.randn(vocab_size, hidden) * 0.1
s = (2 / hidden) ** 0.5

layers = []
for _ in range(n_layers):
    layers.append({
        'g1': np.ones(hidden), 'g2': np.ones(hidden),
        'WQ': np.random.randn(hidden, n_q * hd) * s,
        'WK': np.random.randn(hidden, n_kv * hd) * s,
        'WV': np.random.randn(hidden, n_kv * hd) * s,
        'WO': np.random.randn(n_q * hd, hidden) * s,
        'Wg': np.random.randn(hidden, inter) * s,
        'Wu': np.random.randn(hidden, inter) * s,
        'Wd': np.random.randn(inter, hidden) * (2 / inter) ** 0.5,
        'n_q': n_q, 'n_kv': n_kv, 'hd': hd,
    })
final_gamma = np.ones(hidden)

# Forward pass
token_ids = np.array([3, 7, 12, 1, 15])  # 5 tokens
logits, probs = forward_pass(token_ids, embedding_table, layers, final_gamma)

print(f"Input tokens: {token_ids}")
print(f"Logits shape: {logits.shape}  (seq_len x vocab_size)")
print(f"Probs shape:  {probs.shape}")
print()

# Show predictions for the last token
last_probs = probs[-1]
top5 = np.argsort(last_probs)[-5:][::-1]
print("Next-token predictions (from last position):")
for rank, idx in enumerate(top5):
    print(f"  #{rank+1}: token {idx} with probability {last_probs[idx]:.4f}")

print(f"\nAll probabilities sum to: {np.sum(last_probs):.6f}")
print(f"(Should be 1.0 due to softmax)")

This is a complete, runnable Transformer language model. It is tiny (4 layers, 32-dimensional hidden state, 20-token vocabulary), but it implements every component we have covered in Chapters 4 through 10:

  1. Embedding lookup (Chapter 5): Convert token IDs to vectors
  2. Multi-head attention with GQA (Chapter 8): Gather information from other tokens
  3. SwiGLU FFN (Chapter 9): Process each token independently
  4. RMSNorm (this chapter): Stabilize values before each sublayer
  5. Residual connections (this chapter): Preserve information and enable gradient flow
  6. Output projection + softmax: Convert final vectors to next-token probabilities

The model produces a probability distribution over the vocabulary for each position. The prediction at the last position tells us what token the model thinks should come next. With random weights, these predictions are essentially random. After training on real text (which we will cover in Chapter 14), the model would learn to make meaningful predictions.


Key Takeaways

  • Layer normalization keeps the values flowing through a deep network in a stable range, preventing the exploding and vanishing activation problem. Without normalization, values would drift to extreme magnitudes after passing through dozens of layers, making the network untrainable.

  • RMSNorm is the normalization method used by all major modern LLMs (LLaMA, Mistral, DeepSeek, and others). It normalizes each token’s vector by dividing by the root mean square of its elements, then scaling by a learnable parameter gamma. RMSNorm is simpler and faster than the original LayerNorm because it skips the mean-centering step.

  • Pre-norm placement (normalizing before each sublayer) is universally preferred over post-norm (normalizing after). Xiong et al. (ICML 2020) showed that pre-norm produces well-behaved gradients at initialization, eliminating the need for learning rate warm-up and enabling stable training of very deep models.

  • Residual connections (skip connections) add the input of each sublayer directly to its output: output = input + Sublayer(input). This creates a direct path for gradients to flow through the network without being transformed by layer weights, solving the vanishing gradient problem. Introduced by He et al. (CVPR 2016) for image recognition, residual connections are essential for training networks with dozens or hundreds of layers.

  • The complete Transformer block follows the pattern: RMSNorm, Attention, Residual Add, RMSNorm, FFN, Residual Add. Each block has two residual connections and two normalization operations. The output has the same shape as the input, allowing blocks to be stacked.

  • Modern LLMs stack 32 to 96+ Transformer blocks. LLaMA 3 8B uses 32 layers, LLaMA 4 Maverick uses 48 layers, DeepSeek-V3 uses 61 layers, and GPT-3 uses 96 layers. Each layer adds a different level of linguistic understanding, from basic syntax in early layers to complex semantics and reasoning in later layers.

  • Research by Tenney et al. (ACL 2019) showed that Transformer layers naturally organize into a hierarchy: early layers capture syntax (part-of-speech, basic grammar), middle layers capture semantics (meaning, entity recognition), and later layers capture discourse-level phenomena (coreference, semantic relations). Li and Subramani (arXiv:2506.02132, 2025) confirmed this pattern persists in modern large-scale models across 25 architectures.

  • The residual stream (a framing from Elhage et al., 2021) is a useful mental model: the token embedding creates a vector that flows through the entire model, with each layer’s attention and FFN operations reading from and writing to this stream via addition. The final output is the sum of the original embedding plus all layers’ contributions.

  • In LLaMA 3 8B, each Transformer block contains approximately 218 million parameters, of which the FFN accounts for 80.7%, attention for 19.2%, and RMSNorm for less than 0.01%. Across all 32 layers, the Transformer blocks contain about 87% of the model’s total ~8 billion parameters, with the embedding table and output projection accounting for the remaining 13%.


What’s Next

You now understand all the components of a single Transformer block and how they work together: RMSNorm stabilizes values, attention gathers context, the FFN processes information, and residual connections preserve the gradient highway. You have seen how stacking these blocks creates deep networks capable of sophisticated language understanding. In Chapter 11, we will zoom out to examine model sizes and what they mean: how the number of parameters, hidden dimensions, and layer counts translate into real-world capabilities, memory requirements, and computational costs.