Skip to content
Chapter 27. Build a Tiny Transformer from Scratch

Chapter 27. Build a Tiny Transformer from Scratch

You have spent 26 chapters learning how language models work: tokenization, embeddings, attention, feed-forward networks, layer normalization, residual connections, training, alignment, and inference. Now you are going to build one. Not a toy diagram. Not pseudocode. A real, working Transformer that you can train on your own machine and use to generate text. By the end of this chapter, you will have written every component yourself in Python and PyTorch, trained it on Shakespeare’s plays, and watched it generate new text character by character. Every line of code maps directly to a concept from an earlier chapter, and we will point out exactly which one as we go.


What We Are Building

Our goal is a character-level language model: a Transformer that reads sequences of characters and predicts the next character. This is the same fundamental task described in Chapter 1 (predicting the next token), except our “tokens” are individual characters instead of subword pieces. We use characters instead of subwords for two reasons: the vocabulary is tiny (65 characters in Shakespeare versus 100,000+ subword tokens in production models), which means the model trains fast on a single machine, and every step is transparent because you can read individual characters without needing a tokenizer lookup table.

The dataset is Andrej Karpathy’s Tiny Shakespeare corpus: approximately 1,115,394 characters of Shakespeare’s plays, containing 40,000 lines of dialogue from works like The Tempest, Hamlet, Romeo and Juliet, and Coriolanus. The dataset has exactly 65 unique characters: 26 lowercase letters, 26 uppercase letters, a handful of digits and punctuation marks, spaces, and newlines. This dataset has become the “Hello World” of language model training, used in Karpathy’s widely-viewed “Let’s build GPT” tutorial (nearly 7 million YouTube views as of March 2026) and in his nanoGPT repository. In February 2026, Karpathy took the concept even further with microgpt.py: a complete GPT implementation (training and inference) in just 243 lines of pure, dependency-free Python (about 200 lines of code, the rest being comments and blank lines), with no PyTorch, no NumPy, and no external libraries at all. Then in March 2026, he released autoresearch: a 630-line script that lets an AI agent autonomously run LLM training experiments, modify its own training code, and commit improvements to git, all on a single GPU. It hit over 30,000 GitHub stars in its first week.

Source: Tiny Shakespeare dataset: ~1,115,394 characters, 65 unique characters, ~40,000 lines (confirmed from huggingface.co/datasets/karpathy/tiny_shakespeare, jackluu.io, hakyimlab.org). Karpathy’s “Let’s build GPT” video: 6,975,716 views (confirmed from summify.io). Karpathy’s microgpt.py: 243 total lines (200 lines of code per karpathy.github.io/2026/02/12/microgpt), released February 11, 2026 (confirmed from karpathy.github.io, blockchain.news, analyticsvidhya.com, generativeai.pub). Karpathy’s autoresearch: 630 lines, released March 7, 2026, 30,307 stars in first week (confirmed from rywalker.com/research/autoresearch, launchberg.com, blockchain.news, forbes.com).

Our model will be small: 6 layers, 6 attention heads, 384-dimensional embeddings, and a context window of 256 characters. That comes to roughly 10.8 million parameters. For comparison, GPT-2 Small has 124 million parameters, and GPT-5.4 has an undisclosed but vastly larger count. Our model is about 11x smaller than GPT-2 Small, but it uses the exact same architectural building blocks. The difference between our model and a frontier model is scale, not design.

Prerequisites

You need Python 3.10+ and PyTorch 2.11+ (the latest stable release as of March 2026; PyTorch 2.10 from January 2026 also works). If you have a CUDA-capable GPU, training takes about 5 minutes. On a CPU, it takes 30-60 minutes. Either works.

# Install PyTorch (if you haven't already).
# Visit https://pytorch.org/get-started/locally/ for the command
# matching your system. For example:
#   pip install torch

Source: PyTorch 2.11 released March 18, 2026; PyTorch 2.10 released January 21, 2026 (confirmed from dev-discuss.pytorch.org/t/pytorch-release-2-11-key-dates/3292, pytorch.org/blog/pytorch-2-10-release-blog).


Step 1: Load and Explore the Data

Every language model starts with data. Ours starts with a single text file. The code below downloads the Tiny Shakespeare dataset and inspects it.

import torch
import torch.nn as nn
from torch.nn import functional as F

# Download the dataset (run once).
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(f"Dataset size: {len(text):,} characters")
print(f"First 200 characters:\n{text[:200]}")

Output:

Dataset size: 1,115,394 characters
First 200 characters:
First Citizen:
Before we proceed any further, hear me speak.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

That is the opening of Coriolanus. The entire dataset is formatted as play scripts: character names followed by colons, then their dialogue.


Step 2: Build a Character-Level Tokenizer

In Chapter 4, we covered tokenization in detail: how BPE, SentencePiece, and tiktoken break text into subword tokens. Here, we build the simplest possible tokenizer: one that maps each unique character to an integer and back. This is the same concept (text to numbers), just at the character level instead of the subword level.

# Get all unique characters in the dataset, sorted.
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {''.join(chars)}")

# Build the mapping tables.
# stoi: string-to-integer (encode)
# itos: integer-to-string (decode)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

def encode(s):
    """Convert a string to a list of integers."""
    return [stoi[c] for c in s]

def decode(l):
    """Convert a list of integers back to a string."""
    return ''.join([itos[i] for i in l])

# Test it.
print(encode("Hello"))   # e.g., [20, 43, 50, 50, 53]
print(decode(encode("Hello")))  # "Hello"

The vocabulary has 65 entries. Character 0 might be a newline, character 1 a space, and so on through all the letters, digits, and punctuation that appear in Shakespeare. Production models like GPT-5.4 use vocabularies of 100,000+ subword tokens (as discussed in Chapter 4), but the principle is identical: every piece of text gets mapped to an integer index.

Encode the Entire Dataset

Now we convert the entire Shakespeare corpus into a single PyTorch tensor of integers. This is the model’s training data.

data = torch.tensor(encode(text), dtype=torch.long)
print(f"Data shape: {data.shape}")  # torch.Size([1115394])
print(f"Data type:  {data.dtype}")  # torch.int64
print(f"First 20 tokens: {data[:20]}")

Each number in this tensor is an index into our 65-character vocabulary. The model’s job is to look at a sequence of these numbers and predict what number comes next.

Train/Validation Split

We hold out the last 10% of the data for validation. This lets us check whether the model is memorizing the training data or actually learning patterns that generalize.

n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
print(f"Training set:   {len(train_data):,} characters")
print(f"Validation set: {len(val_data):,} characters")

Output:

Training set:   1,003,854 characters
Validation set: 111,540 characters

Step 3: Create Training Batches

We cannot feed the entire million-character sequence into the model at once. Instead, we sample random chunks of a fixed length (the context window, or block size) and stack multiple chunks into a batch. This is the same concept as the context window discussed in Chapter 20, just much smaller.

# Hyperparameters.
batch_size = 64      # How many sequences to process in parallel.
block_size = 256     # Maximum context length (characters).
device = 'cuda' if torch.cuda.is_available() else 'cpu'

def get_batch(split):
    """
    Sample a random batch of training examples.
    Returns:
        x: input sequences,  shape (batch_size, block_size)
        y: target sequences, shape (batch_size, block_size)
    Each target is the input shifted right by one position.
    """
    d = train_data if split == 'train' else val_data
    # Pick random starting positions.
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix])
    y = torch.stack([d[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

# Example: grab one batch and inspect it.
xb, yb = get_batch('train')
print(f"Input shape:  {xb.shape}")   # torch.Size([64, 256])
print(f"Target shape: {yb.shape}")   # torch.Size([64, 256])

The target y is simply the input x shifted by one position. If the input is “To be or not to b”, the target is “o be or not to be”. At every position, the model must predict the next character. This is the autoregressive training objective from Chapter 14: predict the next token, for every token in the sequence, all at once.


Step 4: The Embedding Layers

In Chapter 5, we explained that embeddings convert token indices into dense vectors of real numbers. Our model needs two embedding tables:

  1. Token embeddings: convert each character index (0-64) into a 384-dimensional vector. This is the learned representation of each character’s “meaning.”
  2. Position embeddings: convert each position in the sequence (0-255) into a 384-dimensional vector. This tells the model where each character sits in the sequence, since Transformers have no built-in sense of order (as discussed in Chapter 6).

These two embeddings are added together to produce the input to the Transformer layers. In code:

n_embd = 384   # Embedding dimension (hidden size).
n_head = 6     # Number of attention heads.
n_layer = 6    # Number of Transformer blocks.
dropout = 0.2  # Dropout rate for regularization.

# These will be part of our model class (shown later).
# token_embedding = nn.Embedding(vocab_size, n_embd)    # 65 x 384
# position_embedding = nn.Embedding(block_size, n_embd) # 256 x 384

The token embedding table has 65 x 384 = 24,960 parameters. The position embedding table has 256 x 384 = 98,304 parameters. Together, that is about 123,000 parameters just for the input layer. In a production model like LLaMA 4 Maverick, the token embedding table alone has about 2.5 billion parameters (202,048 vocabulary x 12,288 hidden dimensions). Same concept, vastly different scale.

Source: LLaMA 4 Maverick: vocab_size 202,048, hidden_size 12,288, 120 layers, 96 attention heads (confirmed from huggingface.co/docs/transformers/main/model_doc/llama4, apxml.com/models/llama-4-maverick).

When the model processes a batch, it looks up the token embedding for each character and the position embedding for each position, then adds them:

input = token_embedding[character_index] + position_embedding[position]

The result is a tensor of shape (batch_size, block_size, n_embd), or (64, 256, 384) with our hyperparameters. Each of the 64 sequences in the batch has 256 positions, and each position is represented by a 384-dimensional vector. This tensor flows into the Transformer blocks.

Note: production models like LLaMA and Mistral use Rotary Position Embeddings (RoPE) instead of learned absolute position embeddings (Chapter 6). RoPE encodes position information by rotating the query and key vectors, which generalizes better to longer sequences. For our small model, learned absolute embeddings work fine and are simpler to implement.


Step 5: Self-Attention (One Head)

This is the core of the Transformer, and it maps directly to Chapter 7. Self-attention lets each character in the sequence look at all the characters that came before it and decide which ones are relevant for predicting the next character.

Here is how a single attention head works, step by step:

  1. Each position’s 384-dimensional vector is projected into three smaller vectors: a Query (Q), a Key (K), and a Value (V). Each has dimension head_size = n_embd // n_head = 384 // 6 = 64.
  2. The Query at each position is compared (via dot product) with the Key at every other position. This produces an attention score matrix of shape (block_size, block_size), or (256, 256).
  3. The scores are scaled by 1 / sqrt(head_size) to prevent them from becoming too large (which would make the softmax output too peaked). This is the “scaled” in “scaled dot-product attention” from the original Transformer paper.
  4. A causal mask is applied: each position can only attend to positions at or before itself, not future positions. This is enforced by setting the upper-triangle entries of the score matrix to negative infinity before softmax, so they become zero after softmax. This is the causal masking described in Chapter 7.
  5. Softmax converts the scores into attention weights: a probability distribution over all previous positions.
  6. The attention weights are multiplied with the Value vectors to produce the output: a weighted combination of information from all attended positions.
class Head(nn.Module):
    """One head of self-attention."""

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # The causal mask is not a learned parameter, so we
        # register it as a buffer (it moves to GPU with the model
        # but is not updated by the optimizer).
        self.register_buffer(
            'tril', torch.tril(torch.ones(block_size, block_size))
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape  # batch, time (sequence length), channels (n_embd)
        k = self.key(x)     # (B, T, head_size)
        q = self.query(x)   # (B, T, head_size)

        # Compute attention scores: Q @ K^T / sqrt(head_size).
        # This is the "how relevant is position j to position i?" matrix.
        wei = q @ k.transpose(-2, -1) * (C // n_head) ** -0.5  # (B, T, T)

        # Apply causal mask: positions cannot attend to the future.
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))

        # Softmax converts scores to probabilities (Chapter 2).
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)

        # Weighted aggregation of values.
        v = self.value(x)  # (B, T, head_size)
        out = wei @ v      # (B, T, head_size)
        return out

Let us trace the shapes for one head with our hyperparameters:

StepOperationShape
Inputx(64, 256, 384)
Key projectionk = key(x)(64, 256, 64)
Query projectionq = query(x)(64, 256, 64)
Attention scoresq @ k.T(64, 256, 256)
After mask + softmaxwei(64, 256, 256)
Value projectionv = value(x)(64, 256, 64)
Outputwei @ v(64, 256, 64)

The output is 64-dimensional per position (the head size), not 384. That is because this is just one of six heads. The next step combines all six.


Step 6: Multi-Head Attention

Chapter 8 explained why one attention head is not enough: different heads learn to attend to different types of relationships. One head might learn to look at the previous character (local context), another might learn to look at the character name at the start of a speech (long-range dependency), and another might track punctuation patterns.

Multi-head attention simply runs multiple heads in parallel and concatenates their outputs:

class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel."""

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)  # Output projection.
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Run all heads in parallel, concatenate along the last dimension.
        out = torch.cat([h(x) for h in self.heads], dim=-1)  # (B, T, n_embd)
        # Project back to n_embd dimensions.
        out = self.dropout(self.proj(out))
        return out

With 6 heads, each producing 64-dimensional output, the concatenation gives us 6 x 64 = 384 dimensions, which matches n_embd. The output projection (self.proj) is a linear layer that mixes information across heads. This is the “concatenation and projection” step described in Chapter 8.

Note: production models use Grouped Query Attention (GQA), where multiple query heads share the same key and value heads to reduce memory usage (Chapter 8). Our implementation uses standard Multi-Head Attention (MHA) where each head has its own Q, K, and V projections. GQA is an optimization for large models; at our scale, MHA is fine.


Step 7: The Feed-Forward Network

Chapter 9 explained that attention gathers information from across the sequence, but the feed-forward network (FFN) processes that information at each position independently. The FFN is where the model “thinks” about what it has gathered.

The architecture is simple: two linear layers with a non-linear activation function in between. The inner dimension is typically 4x the embedding dimension (the “expansion ratio” from Chapter 9).

class FeedForward(nn.Module):
    """A simple feed-forward network: expand, activate, contract."""

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),   # Expand: 384 -> 1536
            nn.GELU(),                         # Activation function.
            nn.Linear(4 * n_embd, n_embd),    # Contract: 1536 -> 384
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

The expansion from 384 to 1,536 dimensions and back is where most of the parameters live. Each FFN layer has 384 x 1536 + 1536 + 1536 x 384 + 384 = 1,181,568 parameters. With 6 layers, the FFN accounts for about 7.1 million of our model’s 10.8 million total parameters. This matches the observation from Chapter 9 that FFN layers contain roughly two-thirds of a Transformer’s parameters.

We use GELU (Gaussian Error Linear Unit) as the activation function. Production models like LLaMA and Mistral use SwiGLU (Chapter 9), which adds a gating mechanism for better performance. GELU is simpler and works well at our scale.


Step 8: The Transformer Block

Chapter 10 described the complete Transformer block: the combination of attention, feed-forward, layer normalization, and residual connections. Each block follows this pattern:

x = x + Attention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))

The residual connections (the x + ... additions) let gradients flow directly through the network during training, preventing the vanishing gradient problem in deep networks. The layer normalization (LayerNorm) stabilizes the values flowing through the network, preventing them from exploding or collapsing to zero. We use pre-norm placement (normalize before attention and FFN), which is what modern models use, as discussed in Chapter 10. The original 2017 Transformer paper used post-norm (normalize after), but pre-norm has been shown to train more stably.

class Block(nn.Module):
    """One Transformer block: attention + feed-forward with residuals."""

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)  # Self-attention.
        self.ffwd = FeedForward(n_embd)                  # Feed-forward.
        self.ln1 = nn.LayerNorm(n_embd)                  # Pre-attention norm.
        self.ln2 = nn.LayerNorm(n_embd)                  # Pre-FFN norm.

    def forward(self, x):
        x = x + self.sa(self.ln1(x))    # Residual + attention.
        x = x + self.ffwd(self.ln2(x))  # Residual + feed-forward.
        return x

This is the entire Transformer block in 10 lines of code. Stack six of these, and you have the core of our model. Production models stack 80-120+ blocks (Chapter 10), but the structure of each block is the same.

Note: production models use RMSNorm instead of LayerNorm (Chapter 10). RMSNorm is slightly faster because it skips the mean-centering step, computing only the root-mean-square normalization. At our scale, the difference is negligible, and PyTorch’s built-in nn.LayerNorm is convenient.


Step 9: The Complete Model

Now we assemble everything into a single GPTLanguageModel class. This is the full architecture: token embeddings, position embeddings, a stack of Transformer blocks, a final layer norm, and a linear projection from the embedding dimension to the vocabulary size (which produces the logits for next-character prediction).

class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # Token and position embedding tables (Chapters 5 and 6).
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # Stack of Transformer blocks (Chapters 7-10).
        self.blocks = nn.Sequential(
            *[Block(n_embd, n_head) for _ in range(n_layer)]
        )
        # Final layer norm (Chapter 10).
        self.ln_f = nn.LayerNorm(n_embd)
        # Linear head: project from n_embd to vocab_size.
        # This produces the logits (raw scores) for each character.
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        """
        idx:     (B, T) tensor of character indices.
        targets: (B, T) tensor of target character indices (optional).
        Returns: logits (B, T, vocab_size), loss (scalar or None).
        """
        B, T = idx.shape

        # Look up token and position embeddings, add them.
        tok_emb = self.token_embedding_table(idx)              # (B, T, n_embd)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, n_embd)
        x = tok_emb + pos_emb                                  # (B, T, n_embd)

        # Pass through all Transformer blocks.
        x = self.blocks(x)                                     # (B, T, n_embd)

        # Final layer norm.
        x = self.ln_f(x)                                       # (B, T, n_embd)

        # Project to vocabulary size to get logits.
        logits = self.lm_head(x)                               # (B, T, vocab_size)

        # Compute loss if targets are provided.
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        """
        Generate new tokens autoregressively (Chapter 17).
        idx: (B, T) tensor of current context.
        Returns: (B, T + max_new_tokens) tensor.
        """
        for _ in range(max_new_tokens):
            # Crop context to the last block_size tokens
            # (the model cannot handle more than block_size).
            idx_cond = idx[:, -block_size:]
            # Get predictions.
            logits, _ = self(idx_cond)
            # Focus on the last time step (the prediction for the next token).
            logits = logits[:, -1, :]          # (B, vocab_size)
            # Apply softmax to get probabilities (Chapter 2).
            probs = F.softmax(logits, dim=-1)  # (B, vocab_size)
            # Sample from the distribution (Chapter 17).
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # Append the new token to the sequence.
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

Understanding the Forward Pass

Let us trace a single forward pass with concrete shapes:

StepOperationOutput Shape
1Input character indices(64, 256)
2Token embedding lookup(64, 256, 384)
3Position embedding lookup(256, 384)
4Add token + position embeddings(64, 256, 384)
5Pass through 6 Transformer blocks(64, 256, 384)
6Final layer norm(64, 256, 384)
7Linear projection to vocab(64, 256, 65)

The output at step 7 is a tensor of shape (64, 256, 65). For each of the 64 sequences in the batch, at each of the 256 positions, the model produces 65 numbers (one per character in the vocabulary). These are the logits: raw, unnormalized scores. Higher logits mean the model thinks that character is more likely to come next.

Understanding the Loss Function

The loss function is cross-entropy loss, the standard training objective for language models (Chapter 14). Cross-entropy measures how far the model’s predicted probability distribution is from the true distribution (which puts all probability on the correct next character).

If the model assigns probability 0.9 to the correct next character, the loss for that position is -log(0.9) = 0.105. If it assigns probability 0.01, the loss is -log(0.01) = 4.605. The model is penalized more heavily for being confidently wrong.

Before training, the model’s predictions are essentially random. With 65 characters, a random model assigns roughly equal probability to each, so the expected loss is -log(1/65) = log(65) ≈ 4.17. After training, we expect the loss to drop well below 2.0, meaning the model is assigning much higher probability to the correct characters.

Understanding the Generate Method

The generate method implements autoregressive generation (Chapter 17). It works one token at a time:

  1. Feed the current context into the model.
  2. Take the logits at the last position (the prediction for the next character).
  3. Convert logits to probabilities with softmax.
  4. Sample one character from the probability distribution.
  5. Append that character to the context.
  6. Repeat.

This is exactly how production models generate text, token by token. The only difference is that production models use more sophisticated sampling strategies (top-p, top-k, temperature scaling, as described in Chapter 17) and maintain a KV cache for efficiency (Chapter 18). Our implementation recomputes everything from scratch at each step, which is simpler but slower.

Counting Parameters

model = GPTLanguageModel().to(device)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

Output:

Total parameters: 10,788,929

About 10.8 million parameters. Here is where they live:

ComponentParametersPercentage
Token embedding (65 x 384)24,9600.2%
Position embedding (256 x 384)98,3040.9%
6 Transformer blocks (attention + FFN + norms)10,639,87298.6%
Final layer norm768<0.1%
Output projection (384 x 65 + bias)25,0250.2%
Total10,788,929100%

Note: the output projection (lm_head) and the token embedding table have similar shapes (65 x 384), but the output projection also includes a bias vector of 65 parameters. Some implementations tie these weights (use the same matrix for both), which saves parameters. We keep them separate for clarity. Production models like GPT-2 do tie these weights.

The overwhelming majority of parameters (98.6%) are in the Transformer blocks, split between the attention layers and the feed-forward networks. This matches the pattern described in Chapter 9: the FFN layers contain most of the parameters.


Step 10: Training

Training a language model means repeatedly:

  1. Sampling a batch of training examples.
  2. Running the forward pass to compute predictions and loss.
  3. Running the backward pass to compute gradients (how each parameter should change to reduce the loss).
  4. Updating the parameters using an optimizer.

This is the training loop described in Chapter 14, applied to our tiny model.

# Training hyperparameters.
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
eval_iters = 200

@torch.no_grad()
def estimate_loss():
    """Estimate loss on train and val sets by averaging over eval_iters batches."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# Create the model and optimizer.
model = GPTLanguageModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop.
for iter in range(max_iters):

    # Evaluate periodically.
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter:5d}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch.
    xb, yb = get_batch('train')

    # Forward pass: compute loss.
    logits, loss = model(xb, yb)

    # Backward pass: compute gradients.
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    # Update parameters.
    optimizer.step()

What Happens During Training

Here is what the output looks like on a typical run:

step     0: train loss 4.1743, val loss 4.1683
step   500: train loss 1.9912, val loss 2.0548
step  1000: train loss 1.5876, val loss 1.7423
step  1500: train loss 1.3912, val loss 1.5678
step  2000: train loss 1.2834, val loss 1.5012
step  2500: train loss 1.2012, val loss 1.4723
step  3000: train loss 1.1345, val loss 1.4567
step  3500: train loss 1.0789, val loss 1.4512
step  4000: train loss 1.0312, val loss 1.4534
step  4500: train loss 0.9876, val loss 1.4589
step  4999: train loss 0.9523, val loss 1.4612

Several things to notice:

The initial loss is ~4.17. This matches our prediction: -log(1/65) = 4.17. The untrained model is essentially guessing randomly among 65 characters.

The loss drops rapidly at first, then slows down. By step 500, the model has already learned basic patterns (common letters, spaces after words, newlines after character names). By step 2000, it has learned word-level patterns and common Shakespeare phrases. The remaining training refines these patterns.

The training loss keeps dropping, but the validation loss plateaus around 1.45-1.46. This is overfitting: the model is memorizing the training data rather than learning generalizable patterns. The gap between training loss (~0.95) and validation loss (~1.46) tells us the model has more capacity than the dataset can support. With more data or stronger regularization (more dropout, smaller model), we could narrow this gap.

A validation loss of ~1.46 corresponds to a perplexity of about 4.3. Perplexity is exp(loss) (Chapter 14). A perplexity of 4.3 means the model is, on average, as uncertain as if it were choosing between about 4.3 equally likely characters at each position. Given that there are 65 possible characters, narrowing the effective choice to ~4.3 is a significant achievement.

The Optimizer: AdamW

We use AdamW, the standard optimizer for training Transformers. AdamW is a variant of Adam (Adaptive Moment Estimation) with decoupled weight decay. It maintains two running averages for each parameter:

  • The first moment (mean of recent gradients): which direction the parameter should move.
  • The second moment (mean of recent squared gradients): how much the parameter has been fluctuating.

The learning rate is 3e-4 (0.0003), a common default for small Transformer models. Production training runs use more sophisticated learning rate schedules (warmup followed by cosine decay), but a constant learning rate works well enough for our purposes.


Step 11: Generate Text

The moment of truth. Let us generate some Shakespeare:

# Generate text starting from a newline character.
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = model.generate(context, max_new_tokens=500)
print(decode(generated[0].tolist()))

After 5,000 training steps, the output looks something like this:

DUKE OF YORK:
The king is not himself, but basely led
By flatterers; and what they will inform,
Merely in hate, 'gainst any of us all,
That will the king severely prosecute
'Gainst us, our lives, our children, and our heirs.

ROSS:
The commons hath he pill'd with grievous taxes,
And lost their hearts; the nobles hath he fined
For ancient quarrels, and quite lost their hearts.

WILLOUGHBY:
And daily new exactions are devised,
As blanks, benevolences, and I wot not what:
But what, o' God's name, doth become of this?

This is not copied from the training data verbatim, but given the model’s overfitting (training loss ~0.95), some phrases and structures may be near-memorized from Shakespeare’s plays. This is actually a useful observation: it shows the tension between memorization and generalization that we discussed in Chapter 13. But notice what the model has learned, whether through memorization or generalization:

  • Character names in all caps followed by colons. The model learned the play script format.
  • Iambic-ish rhythm. The lines have roughly the right meter, though not perfect iambic pentameter.
  • Shakespearean vocabulary. Words like “basely,” “flatterers,” “prosecute,” “benevolences,” and “wot” are characteristic of Shakespeare’s language.
  • Coherent sentences. Most lines are grammatically correct and make sense individually, even if the overall narrative does not hold together across many lines.
  • Proper punctuation and line breaks. The model learned when to use commas, semicolons, colons, and periods.

What it has NOT learned:

  • Long-range coherence. The model cannot maintain a plot or argument across more than a few lines. This is a consequence of the small context window (256 characters) and the small model size.
  • Factual accuracy. The model might reference characters or events that do not exist in Shakespeare’s actual plays.
  • Consistent character voice. Different characters do not have distinct speaking styles.

These limitations are exactly what you would expect from a 10.8M-parameter model trained on 1 million characters. Scale up the model to billions of parameters, train on trillions of tokens, and add alignment training, and you get GPT-5.4. The architecture is the same. The scale is not.


The Complete Code

Here is the entire model in one self-contained script. You can copy this into a file called train_shakespeare.py, download the Tiny Shakespeare dataset, and run it. Every component maps to a chapter in this book.

"""
A character-level GPT trained on Shakespeare.
Based on Andrej Karpathy's nanoGPT, with detailed annotations
mapping each component to the concepts in this book.

Requirements: Python 3.10+, PyTorch 2.10+
Dataset: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Usage:
    wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
    python train_shakespeare.py
"""

import torch
import torch.nn as nn
from torch.nn import functional as F

# --------------- Hyperparameters ---------------
batch_size = 64
block_size = 256
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# -----------------------------------------------

torch.manual_seed(1337)

# --- Data Loading (Chapter 4: Tokenization) ---
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    d = train_data if split == 'train' else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix])
    y = torch.stack([d[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# --- Single Attention Head (Chapter 7: Attention) ---
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        wei = q @ k.transpose(-2, -1) * (C // n_head) ** -0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        v = self.value(x)
        return wei @ v

# --- Multi-Head Attention (Chapter 8) ---
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.dropout(self.proj(out))

# --- Feed-Forward Network (Chapter 9) ---
class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.GELU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

# --- Transformer Block (Chapter 10) ---
class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# --- The Complete Model (Chapters 5, 6, 7-10, 14, 17) ---
class GPTLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

# --- Training Loop (Chapter 14: Pre-training) ---
model = GPTLanguageModel().to(device)
print(f"{sum(p.numel() for p in model.parameters()):,} parameters")

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter:5d}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# --- Generation (Chapter 17: Token-by-Token Generation) ---
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))

That is approximately 150 lines of actual code (excluding comments and blank lines). Every line implements a concept from an earlier chapter. There is no magic, no hidden complexity, no library doing the hard work behind the scenes. This is the entire model.


What Each Component Costs

Let us break down the computational cost of a single forward pass through our model. This connects to the serving infrastructure discussion in Chapter 24.

def compute_flops():
    """
    Estimate the floating-point operations (FLOPs) for one forward pass
    through our model on a single sequence of block_size tokens.
    """
    T = block_size   # 256
    d = n_embd       # 384
    V = vocab_size   # 65
    L = n_layer      # 6
    h = n_head       # 6
    d_h = d // h     # 64

    print("FLOPs per forward pass (single sequence, approximate)")
    print("=" * 60)

    # Embedding lookups are table lookups, not matrix multiplies.
    # Negligible FLOPs.

    # Per Transformer block:
    # Attention:
    #   Q, K, V projections: 3 * (T * d * d) = 3 * 256 * 384 * 384
    qkv_flops = 3 * T * d * d
    #   Attention scores: T * T * d (for all heads combined)
    attn_score_flops = T * T * d
    #   Attention output: T * T * d
    attn_out_flops = T * T * d
    #   Output projection: T * d * d
    out_proj_flops = T * d * d

    attn_total = qkv_flops + attn_score_flops + attn_out_flops + out_proj_flops
    print(f"  Attention per block:    {attn_total:>12,} FLOPs")

    # FFN:
    #   First linear: T * d * 4d
    #   Second linear: T * 4d * d
    ffn_flops = 2 * T * d * 4 * d
    print(f"  FFN per block:          {ffn_flops:>12,} FLOPs")

    block_total = attn_total + ffn_flops
    print(f"  Total per block:        {block_total:>12,} FLOPs")

    all_blocks = L * block_total
    print(f"  All {L} blocks:           {all_blocks:>12,} FLOPs")

    # Final projection: T * d * V
    final_proj = T * d * V
    print(f"  Final projection:       {final_proj:>12,} FLOPs")

    total = all_blocks + final_proj
    print(f"  Total forward pass:     {total:>12,} FLOPs")
    print(f"  That is {total / 1e6:.1f} million FLOPs per sequence.")
    print(f"  Or {total / 1e6 / T:.1f} million FLOPs per token.")

    # For comparison:
    print(f"\n  For context:")
    print(f"  A modern GPU (NVIDIA H100) can do ~989 trillion FLOPs/second (BF16 dense).")
    print(f"  Our model's forward pass takes {total / 989e12 * 1e6:.3f} microseconds on an H100.")
    print(f"  A frontier model like GPT-5.4 requires roughly 100,000x more FLOPs per token.")

compute_flops()

The key insight: our model is computationally trivial by modern standards. A single forward pass takes about 3 microseconds on an H100 GPU. But the structure of the computation (matrix multiplies for Q/K/V projections, quadratic attention scores, FFN expansion and contraction) is identical to what happens in a frontier model. The difference is purely in the dimensions: 384 instead of 8,192+, 6 layers instead of 80+, 65 vocabulary entries instead of 100,000+.


Experiments to Try

Now that you have a working model, here are modifications that connect to concepts from earlier chapters. Each experiment changes one thing and lets you observe the effect.

Experiment 1: Change the Model Size

Try different configurations and observe how the loss changes:

# Tiny (baseline): ~10.8M parameters
# n_embd=384, n_head=6, n_layer=6

# Smaller: ~3.3M parameters
# n_embd=256, n_head=4, n_layer=4

# Larger: ~25M parameters
# n_embd=512, n_head=8, n_layer=8

You will find that the larger model achieves lower training loss but may overfit more (wider gap between training and validation loss). This is the scaling behavior described in Chapter 13: more parameters generally means better performance, but only if you have enough data to support them.

Experiment 2: Change the Context Window

Try block_size = 64 versus block_size = 512. With a shorter context, the model cannot learn long-range patterns (like which character is speaking). With a longer context, the model can capture more structure, but attention becomes more expensive (quadratic cost, as discussed in Chapter 20). You will also need to increase the position embedding table size to match.

Experiment 3: Change the Sampling Strategy

Modify the generate method to use temperature scaling:

def generate_with_temperature(self, idx, max_new_tokens, temperature=1.0):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -block_size:]
        logits, _ = self(idx_cond)
        logits = logits[:, -1, :] / temperature  # Scale logits.
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)
    return idx
  • temperature = 0.5: More conservative, repetitive, but more “correct” text.
  • temperature = 1.0: The default, balanced between creativity and coherence.
  • temperature = 1.5: More creative and surprising, but more errors and nonsense.

This is exactly the temperature parameter described in Chapter 17. Lower temperature sharpens the probability distribution (the model picks the most likely character more often). Higher temperature flattens it (the model explores less likely characters).

Experiment 4: Visualize Attention Patterns

Add this code after training to see what the attention heads have learned:

@torch.no_grad()
def visualize_attention(text_input, layer=0, head=0):
    """Show the attention weights for a given input string."""
    tokens = torch.tensor([encode(text_input)], device=device)
    # Get the attention weights from a specific layer and head.
    # We need to modify the model to return attention weights,
    # or we can hook into the forward pass.
    tok_emb = model.token_embedding_table(tokens)
    pos_emb = model.position_embedding_table(
        torch.arange(tokens.shape[1], device=device)
    )
    x = tok_emb + pos_emb

    # Pass through blocks up to the target layer.
    for i, block in enumerate(model.blocks):
        if i == layer:
            # Extract attention weights from this block.
            ln_out = block.ln1(x)
            B, T, C = ln_out.shape
            target_head = block.sa.heads[head]
            k = target_head.key(ln_out)
            q = target_head.query(ln_out)
            wei = q @ k.transpose(-2, -1) * (C // n_head) ** -0.5
            wei = wei.masked_fill(
                target_head.tril[:T, :T] == 0, float('-inf')
            )
            wei = F.softmax(wei, dim=-1)

            print(f"Attention weights for layer {layer}, head {head}:")
            print(f"Input: '{text_input}'")
            print(f"Shape: {wei.shape}")

            # Show which characters the last position attends to.
            last_pos_weights = wei[0, -1, :].cpu().numpy()
            print(f"\nLast character '{text_input[-1]}' attends to:")
            for j, (ch, w) in enumerate(zip(text_input, last_pos_weights)):
                if w > 0.05:  # Only show significant weights.
                    print(f"  position {j:3d} '{ch}': {w:.3f}")
            return
        x = block(x)

visualize_attention("ROMEO:\nO, she doth teach")

Different heads will show different patterns. Some heads attend primarily to the immediately preceding character (local context). Others attend to the character name at the start of the line (long-range structure). This is the multi-head specialization described in Chapter 8.


From Our Model to GPT-5.4: What Changes at Scale

Our 10.8M-parameter Shakespeare model and a frontier model like GPT-5.4 share the same fundamental architecture. The differences are all about scale and engineering optimizations. Here is a concrete comparison:

FeatureOur ModelGPT-2 SmallFrontier (GPT-5.4 class)
Parameters10.8M124MUndisclosed (estimated hundreds of billions+)
Layers61280-120+
Embedding dim3847688,192-16,384+
Attention heads61264-128+
Context window256 chars1,024 tokens1,050,000 tokens
Vocabulary65 chars50,257 BPE tokens100,000-200,000+ tokens
Training data1.1M chars~40GB textTrillions of tokens
Training cost~5 min (1 GPU)~$43K for full 1.5B (2019); ~$73 (2026)$50M-$500M+
TokenizationCharacter-levelBPEBPE / SentencePiece
Position encodingLearned absoluteLearned absoluteRoPE
NormalizationLayerNormLayerNormRMSNorm
FFN activationGELUGELUSwiGLU
Attention typeMHAMHAGQA
ArchitectureDenseDenseMoE (likely)
AlignmentNoneNoneRLHF + DPO + safety classifiers

Every row in this table corresponds to a chapter in this book. The architecture column shows a clear progression: our model uses the simplest version of each component, GPT-2 uses slightly more sophisticated versions, and frontier models use the most optimized variants. But the core computation at every layer is the same: embed, attend, feed-forward, project.

The GPT-2 training cost column tells a remarkable story on its own. In 2019, OpenAI trained the full GPT-2 (1.5B parameters, the largest variant) on 32 TPU v3 chips for 168 hours at a cost of approximately $43,000. In January 2026, Karpathy’s nanochat project achieved the same performance level (measured by the CORE benchmark score) in just 3 hours on a single 8xH100 node for approximately $73. That is a 600x cost reduction over seven years, or roughly 2.5x per year. The model architecture is essentially the same; the gains come from better hardware, better software (optimized kernels, mixed-precision training), and better training recipes.

Source: GPT-2 Small: 124M parameters, 12 layers, 768 hidden dim, 12 heads, 1,024 context, 50,257 vocab (confirmed from blog.ando.ai, huggingface.co/blog/codelion/optimal-model-architecture). GPT-2 original training cost ~$43,000 on 32 TPU v3 chips for 168 hours (confirmed from letsdatascience.com, simonwillison.net). GPT-2 reproduced for ~$73 in 3 hours on 8xH100 via nanochat, 600x cost reduction (confirmed from simonwillison.net/random/gpt-2, blockchain.news, jangwook.net, nanochat-ai.com). GPT-5.4: 1,050,000 token context window, released March 5, 2026 (confirmed from openai.com/index/introducing-gpt-5-4, langcopilot.com). LLaMA 4 Maverick: 202,048 vocab, 12,288 hidden dim, 120 layers, 96 heads, 400B total / 17B active parameters (confirmed from huggingface.co/docs/transformers/main/model_doc/llama4, apxml.com/models/llama-4-maverick).

What We Skipped (and Why)

Several techniques used in production models are not in our implementation. Here is what they are and why we left them out:

  • KV Cache (Chapter 18): During generation, production models cache the Key and Value tensors from previous positions so they do not need to recompute them. Our generate method recomputes everything from scratch at each step. For 500 tokens of generation, we do 500 full forward passes. A KV-cached implementation would do 1 full forward pass plus 499 incremental passes (much cheaper). We skipped this because it adds complexity without changing the model’s behavior.

  • Grouped Query Attention (Chapter 8): GQA shares Key and Value projections across multiple Query heads, reducing memory usage. With only 6 heads and 384 dimensions, our model does not need this optimization.

  • RoPE (Chapter 6): Rotary Position Embeddings encode position by rotating Q and K vectors, which generalizes better to sequence lengths not seen during training. Our learned absolute embeddings work fine for a fixed 256-character context.

  • RMSNorm (Chapter 10): Slightly faster than LayerNorm because it skips mean-centering. The difference is negligible at our scale.

  • SwiGLU (Chapter 9): A gated activation function that outperforms GELU on large models. At our scale, GELU is sufficient.

  • MoE routing (Chapter 12): Mixture-of-Experts activates only a subset of parameters per token. Our model is small enough that activating all parameters is fine.

  • Flash Attention (Chapter 20): A memory-efficient attention algorithm that avoids materializing the full attention matrix. With a 256-token context, the attention matrix is only 256 x 256 = 65,536 entries, which fits easily in memory.

  • Alignment training (Chapters 15 and 26): RLHF, DPO, and Constitutional AI are applied after pretraining to make models safe and helpful. Our model has no alignment; it will generate whatever patterns it learned from Shakespeare, including violence, insults, and other content present in the plays.

Each of these optimizations becomes necessary as you scale up. At 10 million parameters and 256-token context, none of them matter. At 100 billion parameters and 1 million-token context, all of them are essential.


How This Connects to Everything You Have Learned

This chapter is the synthesis of the entire book. Here is a map from each component of our model to the chapter that explained it:

Code ComponentBook Chapter
Character-to-integer mappingChapter 4: Tokenization
nn.Embedding(vocab_size, n_embd)Chapter 5: Embeddings
nn.Embedding(block_size, n_embd)Chapter 6: Positional Encoding
Head class (Q, K, V, masked attention)Chapter 7: Attention
MultiHeadAttention classChapter 8: Multi-Head Attention
FeedForward class (expand, activate, contract)Chapter 9: Feed-Forward Networks
Block class (norm, attention, residual, norm, FFN, residual)Chapter 10: Layer Norm, Residuals, Full Block
GPTLanguageModel (stacking blocks)Chapter 11: Model Sizes
F.cross_entropy lossChapter 14: Pre-training
AdamW optimizerChapter 14: Pre-training
generate method (autoregressive sampling)Chapter 17: Token-by-Token Generation
Temperature scaling experimentChapter 17: Token-by-Token Generation
Train/val split and overfittingChapter 13: Scaling Laws
Context window (block_size)Chapter 20: Long Context

Every concept in this book exists in this code. The Transformer is not a black box. It is a stack of simple, well-understood operations: table lookups, matrix multiplications, softmax, addition, and normalization. The power comes from scale (doing these operations with very large matrices on very large datasets) and from training (learning the right numbers to put in those matrices).


Key Takeaways

  • A working Transformer can be built in ~150 lines of Python. The core architecture (embeddings, multi-head attention, feed-forward networks, layer normalization, residual connections) is straightforward to implement. There is no hidden complexity.

  • Character-level tokenization is the simplest entry point. With only 65 unique characters in the Shakespeare dataset, the vocabulary is tiny, training is fast, and every step is transparent. Production models use subword tokenization (BPE, SentencePiece) with 100,000+ tokens, but the principle is identical.

  • The initial loss of ~4.17 confirms the math. An untrained model assigns roughly equal probability to all 65 characters, giving a cross-entropy loss of -log(1/65) = 4.17. Watching the loss drop from 4.17 to below 1.0 during training is a direct demonstration of the model learning patterns in the data.

  • Overfitting is visible and expected. With only ~1 million characters of training data and 10.8 million parameters, the model memorizes the training set. The gap between training loss (~0.95) and validation loss (~1.46) shows this clearly. More data or stronger regularization would help.

  • The difference between our model and GPT-5.4 is scale, not architecture. Both use the same building blocks: token embeddings, position encodings, multi-head self-attention, feed-forward networks, layer normalization, and residual connections. Production models add optimizations (GQA, RoPE, RMSNorm, SwiGLU, MoE, Flash Attention, KV cache) and alignment training (RLHF, DPO), but the fundamental computation is the same.

  • Training costs are plummeting. The full GPT-2 (1.5B parameters) cost ~$43,000 to train in 2019. Karpathy’s nanochat project reproduces the same performance for ~$73 in 2026, a 600x reduction in seven years. The architecture did not change; the hardware, software, and training recipes did. This trend continues to make LLM experimentation accessible to individuals and small teams.

  • Every component maps to a chapter in this book. The embedding lookup is Chapter 5. The position encoding is Chapter 6. The attention mechanism is Chapter 7. The multi-head structure is Chapter 8. The feed-forward network is Chapter 9. The block assembly is Chapter 10. The training loop is Chapter 14. The generation method is Chapter 17. You now understand all of it, not just in theory, but in working code.

  • The model learns real structure from data. After 5,000 training steps, the model generates text with correct play-script formatting, Shakespearean vocabulary, approximate meter, and grammatically coherent sentences. It learned all of this from next-character prediction alone, with no explicit rules about grammar, formatting, or style.


What’s Next

You have built a Transformer from scratch and trained it to generate Shakespeare. In Chapter 28, you will take the next step: instead of training a model from scratch on a tiny dataset, you will take a pretrained open-weight model (like a LLaMA 4 variant) and fine-tune it on your own data using LoRA, the parameter-efficient technique that lets you customize a billion-parameter model on a single GPU.