Chapter 14. Pre-training, Learning Language
Every capability you have seen in this book so far, attention, feed-forward networks, MoE routing, the full Transformer block, is just architecture: an empty structure with random weights. A model fresh off the drawing board cannot do anything useful. It does not know that “Paris” is a city, that Python uses indentation for blocks, or that 2 + 2 = 4. All of that knowledge comes from pre-training: the process of exposing the model to trillions of tokens of text and letting it learn, one gradient update at a time, to predict what comes next. Pre-training is where a language model goes from a blank slate to something that understands language, and it is by far the most expensive, most time-consuming, and most consequential step in building an LLM.
The Objective: Predict the Next Token
The pre-training objective for modern LLMs is remarkably simple: given a sequence of tokens, predict the next one. This is called causal language modeling (or next-token prediction), and it is the same objective we introduced in Chapter 1.
Formally, given a sequence of tokens x_1, x_2, …, x_n, the model computes a probability distribution over the entire vocabulary for the next token x_{n+1}. The training objective is to maximize the probability assigned to the correct next token, or equivalently, to minimize the cross-entropy loss between the model’s predicted distribution and the true next token.
The cross-entropy loss for a single token prediction is:
L = -log(P(x_correct))Where P(x_correct) is the probability the model assigns to the actual next token. If the model is confident and correct (assigns high probability to the right token), the loss is low. If the model is wrong or uncertain, the loss is high.
For a full sequence of T tokens, the total loss is the average across all positions:
L = -(1/T) * sum(log(P(x_t | x_1, ..., x_{t-1}))) for t = 1 to TThis is the same cross-entropy loss from Chapter 3, applied at every position in the sequence. The model processes the entire sequence in parallel (thanks to causal masking, as described in Chapter 7), computing predictions for all positions simultaneously. Each position only attends to tokens before it, so the model is genuinely predicting the next token at every position, not peeking ahead.
In practice, training sequences are often packed with multiple documents concatenated together to fill the full sequence length efficiently. LLaMA 3 uses an additional attention mask that prevents tokens in one document from attending to tokens in a different document within the same sequence. This ensures the model does not learn spurious cross-document patterns. The LLaMA 3 paper notes this had limited impact during initial pre-training but was helpful during the long-context training phase.
Why Next-Token Prediction Works
It may seem surprising that simply predicting the next word can produce a model that appears to “understand” language, write code, solve math problems, and reason about complex topics. The key insight is that accurate next-token prediction requires an enormous amount of implicit knowledge.
Consider predicting the next token in: “The capital of France is”. To assign high probability to “Paris,” the model must have learned geography. For “The derivative of x^2 is,” it must have learned calculus. For “def fibonacci(n):\n if n <= 1:\n return,” it must have learned Python syntax and the Fibonacci algorithm.
As the training data grows to trillions of tokens spanning every domain of human knowledge, the model is forced to internalize an increasingly comprehensive representation of the world, all in service of the simple objective of predicting what comes next.
A Concrete Example
Let’s see what the loss looks like for a single training example:
import numpy as np
# Simulated vocabulary of 128,000 tokens (like LLaMA 3's tokenizer)
vocab_size = 128_000
# A training sequence: "The cat sat on the mat"
# After tokenization, suppose this becomes token IDs:
tokens = [464, 3857, 3290, 319, 278, 1775]
# (These are illustrative IDs, not actual tokenizer output)
# For each position, the model outputs a probability distribution
# over all 128,000 tokens. Let's simulate one position:
# The model is predicting the token after "The cat sat on the"
# The correct next token is 1775 ("mat")
# A random (untrained) model assigns roughly equal probability:
random_prob = 1.0 / vocab_size
random_loss = -np.log(random_prob)
print(f"Random model probability for correct token: {random_prob:.8f}")
print(f"Random model loss: {random_loss:.2f} nats")
# A well-trained model might assign 15% probability to "mat":
trained_prob = 0.15
trained_loss = -np.log(trained_prob)
print(f"\nTrained model probability for correct token: {trained_prob:.2f}")
print(f"Trained model loss: {trained_loss:.2f} nats")
# A frontier model might assign 40% probability to "mat":
frontier_prob = 0.40
frontier_loss = -np.log(frontier_prob)
print(f"\nFrontier model probability for correct token: {frontier_prob:.2f}")
print(f"Frontier model loss: {frontier_loss:.2f} nats")
print(f"\nNote: Loss of {random_loss:.2f} nats for random model vs")
print(f"{frontier_loss:.2f} nats for frontier model.")
print(f"The scaling laws from Chapter 13 describe how this loss")
print(f"decreases as you scale up model size, data, and compute.")The random model’s loss of approximately 11.76 nats reflects the fact that it is guessing uniformly among 128,000 tokens. A frontier model achieving a loss around 1.7-2.0 nats on web text (as discussed in Chapter 13) is assigning meaningful probability to the correct token at every position, which requires deep understanding of language, facts, and reasoning patterns.
Training Data: What Models Learn From
The quality and composition of training data is arguably the single most important factor in determining a model’s capabilities. As we saw in Chapter 13, the industry has moved from hundreds of billions of tokens (GPT-3, 2020) to tens of trillions (Qwen 3, 2025). But raw quantity is not enough; the data must be diverse, high-quality, and carefully curated.
Sources of Training Data
Modern LLM training datasets are assembled from multiple sources, each contributing different types of knowledge:
Web text is the largest source by volume. The primary raw material is Common Crawl, a nonprofit organization that has been crawling the web since 2008 and maintains a freely available archive of over 300 billion web pages totaling petabytes of data. Each monthly crawl captures roughly 2.3 to 3 billion web pages (for example, the August 2025 crawl contained 2.44 billion pages at 424 TiB uncompressed, while the October 2025 crawl was larger at 2.61 billion pages and 468 TiB). However, raw Common Crawl data is extremely noisy: it contains spam, boilerplate HTML, navigation menus, cookie notices, duplicate content, and low-quality text. Extensive processing is required to extract useful training data. In November 2025, an investigation by The Atlantic revealed that Common Crawl had not fully honored publisher requests to remove paywalled content from its archives, raising ongoing questions about the legal and ethical dimensions of web-scale data collection for AI training.
Code from public repositories (primarily GitHub) provides the model with programming knowledge. Code data is particularly valuable because it is structured, logical, and often accompanied by comments and documentation. Meta’s LLaMA 3 technical report describes using code data as a significant component of their training mix, and notes that they used classifiers to identify and upsample high-quality code and reasoning content.
Books and academic papers provide long-form, well-structured text with deep reasoning. Sources include digitized books, scientific papers from repositories like arXiv and Semantic Scholar, and educational content. This data tends to be higher quality than web text but is available in much smaller quantities.
Multilingual data ensures the model can handle languages beyond English. LLaMA 3 was trained on data spanning multiple languages, and Qwen 3 explicitly covers 119 languages in its 36 trillion token training corpus.
Mathematical and reasoning data includes textbooks, problem sets, proofs, and structured reasoning examples. This category has become increasingly important as labs push models toward stronger reasoning capabilities.
Synthetic data generated by earlier models (as discussed in Chapter 13) supplements human-generated data, particularly for domains like code and mathematics where correctness can be verified. Qwen 3’s training data explicitly includes synthetic code and math content generated by earlier Qwen models.
Sources: Common Crawl (commoncrawl.org): August 2025 crawl contained 2.44 billion web pages, 424 TiB uncompressed; October 2025 crawl contained 2.61 billion pages, 468 TiB uncompressed. Total archive exceeds 250 billion pages across all crawls since 2008. The Atlantic, November 4, 2025: Alex Reisner, “The Company Quietly Funneling Paywalled Articles to AI Developers.” Meta, “The Llama 3 Herd of Models,” arXiv:2407.21783, July 2024. Qwen 3 from Alibaba (April 29, 2025): approximately 36T training tokens spanning 119 languages, including synthetic data.
The Data Processing Pipeline
Raw data from these sources cannot be fed directly into a model. It must go through an extensive processing pipeline that typically includes the following stages:
Stage 1: Extraction and Parsing
For web data, the first step is extracting clean text from raw HTML. This involves removing HTML tags, JavaScript, CSS, navigation elements, headers, footers, cookie banners, and other boilerplate. The goal is to isolate the main content of each page. Meta’s LLaMA 3 technical report describes building a custom HTML parser for this purpose, reflecting how important this seemingly mundane step is.
For code data, extraction involves parsing repository structures, identifying source files, and extracting code along with associated documentation, comments, and README files.
Stage 2: Language Identification
Multilingual datasets require identifying the language of each document so that the training mix can be controlled. Language identification models (typically small classifiers) assign a language label and confidence score to each document. Documents with low confidence scores or mixed languages may be filtered out or handled specially.
Stage 3: Quality Filtering
This is the most critical and nuanced stage. The goal is to remove low-quality content while retaining diverse, informative text. Quality filtering typically uses a combination of:
Heuristic filters remove documents based on simple rules:
- Documents that are too short (fewer than a threshold number of words)
- Documents with excessive repetition (e.g., the same sentence repeated many times)
- Documents with too many special characters, URLs, or non-text content
- Documents with abnormal word length distributions (indicating garbled text)
- Documents that consist primarily of lists, tables, or navigation elements
Model-based classifiers use a trained classifier to score each document’s quality. A common approach is to train a classifier to distinguish between high-quality reference text (like Wikipedia or curated educational content) and random web text. Documents scoring below a threshold are removed. HuggingFace’s FineWeb dataset, a 15-trillion token dataset derived from 96 Common Crawl snapshots, uses this approach extensively and documents the impact of each filtering step on downstream model performance.
Perplexity filtering uses a small language model to compute the perplexity (a measure of how “surprising” the text is to the model) of each document. Documents with very high perplexity (garbled or nonsensical text) or very low perplexity (repetitive boilerplate) are removed.
Source: Penedo et al., “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale,” arXiv:2406.17557, June 2024. HuggingFace. 15 trillion tokens from 96 Common Crawl snapshots.
Stage 4: Deduplication
Web crawls contain enormous amounts of duplicate content: the same article republished on multiple sites, boilerplate text shared across pages, and near-identical documents with minor variations. Training on duplicate data wastes compute and can cause the model to memorize specific passages rather than learning general patterns.
Deduplication operates at multiple levels:
URL-level deduplication removes pages with identical URLs across different crawl snapshots.
Exact document deduplication uses hash functions to identify and remove documents with identical content.
Near-duplicate detection uses techniques like MinHash (a locality-sensitive hashing algorithm) to identify documents that are highly similar but not identical. MinHash works by computing a set of hash signatures for each document based on its n-grams (short sequences of words). Documents whose MinHash signatures are sufficiently similar (typically above a Jaccard similarity threshold of 0.8) are considered near-duplicates, and all but one copy are removed.
Line-level deduplication removes individual lines that appear frequently across many documents (such as copyright notices, navigation text, or common disclaimers).
The impact of deduplication is substantial. Lee et al. (2022) demonstrated that deduplication reduces memorization by 10x and improves model performance, requiring fewer training steps to achieve the same loss.
Source: Lee et al., “Deduplicating Training Data Makes Language Models Better,” ACL 2022 (arXiv:2107.06499). Demonstrated 10x reduction in memorized text and improved perplexity with deduplicated training data.
Stage 5: Toxicity and PII Removal
Training data must be filtered to remove personally identifiable information (PII) such as phone numbers, email addresses, social security numbers, and physical addresses. It must also be filtered to reduce toxic, harmful, or illegal content. This typically involves:
- Regular expression patterns to detect and remove common PII formats
- Toxicity classifiers trained to identify harmful content
- Domain-level blocklists to exclude known sources of harmful content
Meta’s LLaMA 3 report specifically mentions PII removal as an early step in their data processing pipeline.
Stage 6: Data Mixing
The final step is combining data from all sources into a training mixture with carefully chosen proportions. The mix ratio determines what the model learns to prioritize. A mix with more code produces a better coder; a mix with more multilingual data produces better multilingual capabilities; a mix with more reasoning data produces stronger reasoning.
Labs determine optimal mix ratios through extensive experimentation, typically by training small models on different mixes and evaluating performance on a suite of benchmarks. Meta’s LLaMA 3 report describes using classifiers to identify and upsample high-quality code and reasoning content, and adjusting the data mix during training to emphasize different capabilities at different stages.
Unlike many labs, Meta disclosed the approximate data mix for LLaMA 3 in their technical report:
| Category | Share | Description |
|---|---|---|
| General knowledge | ~50% | Web text, encyclopedic content, news, forums |
| Mathematical and reasoning | ~25% | Textbooks, problem sets, structured reasoning |
| Code | ~17% | Source code from public repositories, documentation |
| Multilingual | ~8% | Non-English text across multiple languages |
Meta also used classifiers to identify and upsample high-quality content within each category, and adjusted the data mix during training to emphasize different capabilities at different stages (for example, upsampling math and code data later in training).
Other frontier labs are less transparent. The general pattern across the industry is that web text constitutes the largest share, followed by code, reasoning data, and multilingual content, but the exact ratios vary by lab and are closely guarded.
Source: Meta (arXiv:2407.21783), Section 3.1: LLaMA 3 data mix of approximately 50% general knowledge, 25% mathematical and reasoning tokens, 17% code, and 8% multilingual data.
Visualizing the Data Pipeline
Let’s trace the journey of web data from raw crawl to training-ready tokens:
import numpy as np
# Simulated data pipeline for a Common Crawl snapshot
# Numbers are approximate and illustrative of real pipelines
stages = [
("Raw Common Crawl snapshot", 2_400_000_000, "documents"),
("After HTML extraction", 2_400_000_000, "documents"),
("After language identification", 800_000_000, "documents (English only)"),
("After heuristic filtering", 400_000_000, "documents"),
("After model-based quality filter", 150_000_000, "documents"),
("After exact deduplication", 120_000_000, "documents"),
("After MinHash near-dedup", 80_000_000, "documents"),
("After PII/toxicity removal", 75_000_000, "documents"),
("After tokenization", 75_000_000, "documents"),
]
print("Data Pipeline: One Common Crawl Snapshot → Training Data")
print("=" * 70)
initial = stages[0][1]
for name, count, unit in stages:
pct = (count / initial) * 100
print(f" {name:<45} {count:>14,} {unit:<10} ({pct:5.1f}%)")
# Estimate token yield
avg_tokens_per_doc = 2000 # rough average after filtering
total_tokens = stages[-1][1] * avg_tokens_per_doc
print(f"\nEstimated token yield: {total_tokens/1e9:.0f}B tokens from one snapshot")
print(f"FineWeb used 96 snapshots to produce 15T tokens total")
print(f"That is roughly {15e12 / 96 / 1e9:.0f}B tokens per snapshot after all processing")This pipeline shows why data processing is so labor-intensive. A single Common Crawl snapshot starts with billions of documents but yields only a fraction of usable training tokens after filtering. The FineWeb dataset, one of the most carefully documented open training datasets, processed 96 Common Crawl snapshots to produce 15 trillion tokens, with each processing step measurably improving downstream model performance.
The Training Process: From Random Weights to Language Understanding
With the training data prepared, the actual training process can begin. This section walks through the mechanics of how a model learns from data.
Initialization
Before training starts, all of the model’s weights are initialized to small random values. The specific initialization scheme matters: weights that are too large cause the model’s outputs to explode; weights that are too small cause gradients to vanish. Modern Transformers typically use a variant of Xavier or Kaiming initialization, scaled by the number of layers to ensure stable signal propagation through deep networks.
At initialization, the model’s predictions are essentially random. For a vocabulary of 128,000 tokens, the model assigns roughly equal probability (about 1/128,000 = 0.0000078) to every token at every position. The initial cross-entropy loss is approximately ln(128,000) = 11.76 nats, as we computed earlier.
The Training Loop
Training proceeds in iterations, where each iteration processes a batch of training sequences. Here is the core loop:
Sample a batch: Select a batch of B sequences, each of length S tokens, from the training data. A typical batch might contain millions of tokens. LLaMA 3 405B started with a smaller batch size for training stability and gradually increased it during training for efficiency. With an 8,192-token sequence length, a batch of 2,048 sequences would contain approximately 16 million tokens.
Forward pass: Feed the batch through the model. The model processes all tokens in parallel (using causal masking to prevent looking ahead) and produces a probability distribution over the vocabulary at each position. This is the same forward pass described in Chapters 7-10, applied to every token in the batch simultaneously.
Compute loss: Compare the model’s predictions to the actual next tokens and compute the cross-entropy loss averaged across all positions and all sequences in the batch.
Backward pass: Compute the gradient of the loss with respect to every weight in the model using backpropagation (Chapter 3). This tells us how to adjust each weight to reduce the loss.
Update weights: Use an optimizer to update the weights based on the gradients. The standard optimizer for LLM training is AdamW, which maintains running averages of both the gradients (first moment) and the squared gradients (second moment) for each parameter, and uses these to adaptively scale the learning rate for each weight. AdamW also applies weight decay, a form of regularization that gently pushes weights toward zero to prevent overfitting.
Repeat: Go back to step 1 with the next batch.
The Optimizer: AdamW
AdamW is the dominant optimizer for LLM pre-training. It was introduced by Loshchilov and Hutter (2017) as a correction to the original Adam optimizer, which improperly coupled weight decay with the adaptive learning rate. AdamW decouples these two mechanisms, resulting in better generalization.
For each parameter w, AdamW maintains two state variables:
- m (first moment): an exponentially weighted moving average of the gradients
- v (second moment): an exponentially weighted moving average of the squared gradients
The update rule at each step t is:
m_t = beta_1 * m_{t-1} + (1 - beta_1) * g_t
v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2
m_hat = m_t / (1 - beta_1^t) # bias correction
v_hat = v_t / (1 - beta_2^t) # bias correction
w_t = w_{t-1} - lr * (m_hat / (sqrt(v_hat) + epsilon) + lambda * w_{t-1})Where beta_1 is typically 0.9, beta_2 is typically 0.95 (for LLM training), lr is the learning rate, epsilon is a small constant (e.g., 1e-8) for numerical stability, and lambda is the weight decay coefficient.
The critical implication of AdamW for memory is that it stores two additional float32 values (m and v) per parameter. For a model with N parameters, AdamW requires 8N bytes of optimizer state (4 bytes each for m and v in float32), on top of the model weights themselves. For LLaMA 3 405B, this means approximately 3.2 TB of optimizer state alone, which is why training requires many GPUs even beyond what the model weights demand.
Source: Loshchilov and Hutter, “Decoupled Weight Decay Regularization,” ICLR 2019 (arXiv:1711.05101). Originally published November 2017.
Learning Rate Schedule
The learning rate controls how large each weight update is. Too high, and the model’s loss oscillates or diverges. Too low, and training is painfully slow. Modern LLM training uses a carefully designed learning rate schedule that changes the learning rate over the course of training:
Warmup phase: The learning rate starts at zero (or near zero) and linearly increases to the peak learning rate over a set number of steps (typically 1,000-2,000 steps). This prevents the model from making large, destabilizing updates early in training when the gradients are noisy and the loss landscape is poorly understood.
Decay phase: After warmup, the learning rate gradually decreases following a cosine schedule, which smoothly reduces the learning rate from the peak value to a minimum value (typically 10% of the peak, or sometimes zero) over the remaining training steps. The cosine schedule is the most common choice for LLM pre-training.
import numpy as np
def cosine_lr_schedule(step, total_steps, warmup_steps, peak_lr, min_lr):
"""Standard cosine learning rate schedule with linear warmup."""
if step < warmup_steps:
# Linear warmup
return peak_lr * (step / warmup_steps)
else:
# Cosine decay
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + 0.5 * (peak_lr - min_lr) * (1 + np.cos(np.pi * progress))
# Example: LLaMA 3 405B-scale training
total_steps = 100_000 # approximate total training steps
warmup_steps = 2_000 # warmup over first 2,000 steps
peak_lr = 1.5e-4 # peak learning rate
min_lr = 1.5e-5 # minimum learning rate (10% of peak)
steps = np.arange(total_steps)
lrs = [cosine_lr_schedule(s, total_steps, warmup_steps, peak_lr, min_lr) for s in steps]
# Print key points
print("Learning Rate Schedule (Cosine with Linear Warmup)")
print(f"{'Step':>8} {'Learning Rate':>15} {'Phase'}")
print("-" * 40)
for s in [0, 500, 1000, 2000, 10000, 50000, 90000, 99999]:
lr = cosine_lr_schedule(s, total_steps, warmup_steps, peak_lr, min_lr)
phase = "Warmup" if s < warmup_steps else "Cosine decay"
print(f"{s:>8,} {lr:>15.2e} {phase}")The learning rate schedule is one of the most important hyperparameters in LLM training. Getting it wrong can waste millions of dollars in compute. This is why labs run extensive small-scale experiments to determine the optimal schedule before committing to a full training run, as described in Chapter 13’s discussion of scaling law methodology.
Mixed Precision Training
Modern LLM training uses mixed precision arithmetic to reduce memory usage and increase throughput. Instead of storing all values in 32-bit floating point (FP32), which uses 4 bytes per number, training uses a combination of lower-precision formats:
BF16 (BrainFloat16) uses 16 bits (2 bytes) per number. It has the same exponent range as FP32 (8 exponent bits) but reduced precision (7 mantissa bits vs. 23 for FP32). The wide exponent range prevents overflow and underflow, making BF16 more stable than the older FP16 format for training. BF16 has become the standard precision for LLM pre-training, with hardware support on NVIDIA H100 and newer GPUs.
FP8 uses only 8 bits (1 byte) per number and is supported on H100 GPUs via the Transformer Engine. DeepSeek-V3 pioneered the use of FP8 for training, using it for most matrix multiplications while keeping critical accumulations in higher precision. FP8 training can provide significant speedups but requires careful handling of numerical precision.
In a typical mixed-precision training setup:
- Model weights are stored in BF16 (2 bytes per parameter)
- Gradients are computed in BF16
- Optimizer states (AdamW’s m and v) are stored in FP32 (4 bytes each)
- A master copy of weights is maintained in FP32 for the optimizer update
This means the total memory per parameter during training is approximately:
- 2 bytes (BF16 weights) + 4 bytes (FP32 master weights) + 4 bytes (m) + 4 bytes (v) + 2 bytes (gradients) = 16 bytes per parameter
For LLaMA 3 405B with 405 billion parameters, this translates to approximately 6.5 TB of memory just for model state, before accounting for activations, the KV cache, or the training data itself.
| Component | Precision | Bytes/Param | 405B Model |
|---|---|---|---|
| Model weights (training) | BF16 | 2 | 810 GB |
| Master weights (optimizer) | FP32 | 4 | 1,620 GB |
| First moment (m) | FP32 | 4 | 1,620 GB |
| Second moment (v) | FP32 | 4 | 1,620 GB |
| Gradients | BF16 | 2 | 810 GB |
| Total | 16 | 6,480 GB |
This is why training a 405B model requires a cluster of thousands of GPUs: no single GPU (or even a single server with 8 GPUs) can hold all of this state.
Distributed Training: Splitting the Work Across Thousands of GPUs
A single NVIDIA H100 GPU has 80 GB of HBM3 memory. Training LLaMA 3 405B requires approximately 6.5 TB of memory for model state alone. That is over 80 H100s just to hold the model, before any activations or data. And even if memory were not a constraint, training on 15.6 trillion tokens with a single GPU would take decades. The solution is distributed training: splitting the work across thousands of GPUs running in parallel.
Modern LLM training uses a combination of four parallelism strategies, often called 4D parallelism (the term used by Meta in the LLaMA 3 technical report):
1. Data Parallelism (DP)
Data parallelism is the simplest form of distributed training. The model is replicated across multiple GPUs, and each GPU processes a different subset of the training batch. After each forward and backward pass, the gradients from all GPUs are averaged (using an all-reduce operation), and each GPU updates its copy of the weights identically.
The benefit is straightforward: if you have 8 GPUs, you can process 8x as many tokens per step, which means training finishes 8x faster (in theory; communication overhead reduces the actual speedup).
The limitation is that every GPU must hold a complete copy of the model and its optimizer state. For a 405B model requiring 6.5 TB, pure data parallelism is impossible because no single GPU can hold the full model.
Fully Sharded Data Parallelism (FSDP), developed by Meta and integrated into PyTorch, addresses this by sharding the model parameters, gradients, and optimizer states across all GPUs in the data-parallel group. Each GPU holds only a fraction of the total state. When a layer needs its full parameters for computation, the shards are gathered from all GPUs (using an all-gather operation), the computation is performed, and the shards are released. This trades communication bandwidth for memory savings, allowing data parallelism to scale to much larger models.
2. Tensor Parallelism (TP)
Tensor parallelism splits individual layers across multiple GPUs. Instead of each GPU holding a complete copy of a layer’s weight matrices, the matrices are divided (sharded) across GPUs, and each GPU computes its portion of the result.
For example, consider a feed-forward layer with a weight matrix of shape [16384, 53248] (as in LLaMA 3 405B’s FFN, where hidden_size=16384 and intermediate_size=53248). With tensor parallelism across 8 GPUs, each GPU holds a slice of shape [16384, 6656] and computes its portion of the matrix multiplication. The partial results are then combined (using an all-reduce or reduce-scatter operation) to produce the full output.
Tensor parallelism is highly efficient because the communication happens within a single server (across GPUs connected by NVLink, which provides 900 GB/s bandwidth on H100 systems). It is typically applied within a single node of 8 GPUs.
This approach was pioneered by NVIDIA’s Megatron-LM framework, which demonstrated efficient tensor and pipeline parallelism for training models with hundreds of billions of parameters.
Source: Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,” arXiv:1909.08053, September 2019. NVIDIA. Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM,” arXiv:2104.04473, April 2021. Demonstrated 502 petaFLOP/s on 3,072 GPUs with 1T parameter model.
3. Pipeline Parallelism (PP)
Pipeline parallelism splits the model’s layers across multiple GPUs sequentially. If a model has 126 layers (like LLaMA 3 405B), you might assign layers 1-16 to GPU group 1, layers 17-32 to GPU group 2, and so on. Each GPU group processes its assigned layers and passes the activations to the next group.
The challenge with naive pipeline parallelism is the pipeline bubble: while GPU group 1 is processing the first micro-batch, GPU groups 2-8 are idle, waiting for their input. To mitigate this, modern implementations use micro-batching: the batch is split into multiple smaller micro-batches, and the pipeline is kept busy by overlapping the processing of different micro-batches at different stages.
Meta’s LLaMA 3 training used an interleaved pipeline schedule to minimize bubble overhead, achieving high GPU utilization even with deep pipelines.
4. Context Parallelism (CP)
Context parallelism is the newest addition, designed specifically for training on long sequences. It splits the input sequence across multiple GPUs, with each GPU processing a different chunk of the sequence. The attention mechanism requires communication between GPUs (since each token needs to attend to tokens on other GPUs), but this communication can be overlapped with computation using ring-based all-gather operations.
Context parallelism was essential for LLaMA 3’s 128K context length training. Without it, the memory required for attention activations on long sequences would exceed what is available on a single GPU.
Putting It All Together: 4D Parallelism
In practice, all four strategies are combined. For LLaMA 3 405B training on 16,384 H100 GPUs:
- Tensor parallelism across 8 GPUs within each server (using NVLink)
- Pipeline parallelism across multiple servers (splitting the 126 layers into pipeline stages)
- Data parallelism across groups of servers (each group processes different data)
- Context parallelism for long-sequence training phases
The result is that each GPU holds only a small fraction of the model and processes only a small fraction of the data, but the collective computation of all 16,384 GPUs produces the same result as if a single impossibly large GPU were processing the entire batch.
# Illustrate the parallelism dimensions for LLaMA 3 405B training
# (Meta did not disclose exact TP/PP/DP values; these are reasonable estimates
# based on the 126-layer architecture and 16,384 GPUs.)
total_gpus = 16_384
tp = 8 # tensor parallelism: 8 GPUs per node
pp = 16 # pipeline parallelism: 16 pipeline stages
dp = total_gpus // (tp * pp) # data parallelism: remaining dimension
print("LLaMA 3 405B: 4D Parallelism Configuration")
print(f" Total GPUs: {total_gpus:,}")
print(f" Tensor Parallelism: {tp} GPUs (within one server)")
print(f" Pipeline Parallelism: {pp} stages (across servers)")
print(f" Data Parallelism: {dp} replicas")
print(f" Total: {tp} x {pp} x {dp} = {tp * pp * dp:,} GPUs")
print()
# Memory per GPU (approximate)
total_params_B = 405
bytes_per_param = 16 # mixed precision training
total_memory_TB = total_params_B * 1e9 * bytes_per_param / 1e12
# Each GPU holds 1/(tp*pp) of the model (sharded across TP and PP)
memory_per_gpu_GB = total_memory_TB * 1000 / (tp * pp)
print(f" Total model state: {total_memory_TB:.1f} TB")
print(f" Per-GPU model state: ~{memory_per_gpu_GB:.1f} GB (before activations)")
print(f" H100 memory: 80 GB")
print(f" Remaining for activations, KV cache, data: ~{80 - memory_per_gpu_GB:.0f} GB")Source: Meta, “The Llama 3 Herd of Models,” arXiv:2407.21783, July 2024. LLaMA 3.1 405B trained on up to 16,384 H100 GPUs using 4D parallelism (tensor, pipeline, data, and context parallelism). Achieved 38-43% BF16 Model FLOPs Utilization (MFU). Network: RDMA over Converged Ethernet (RoCE) fabric with 400 Gbps interconnect for the 405B training run. Each server: 8 GPUs connected by NVLink, 2 CPUs. Storage: Meta’s Tectonic distributed file system, 240 PB storage, 2 TB/s sustained throughput.
Model FLOPs Utilization (MFU)
A key metric for distributed training efficiency is Model FLOPs Utilization (MFU): the ratio of the actual floating-point operations performed for model computation to the theoretical peak of the hardware. An MFU of 100% would mean every GPU is performing useful model computation at its maximum theoretical rate, with zero overhead for communication, memory operations, or idle time.
In practice, MFU is always well below 100% due to communication overhead, pipeline bubbles, memory operations, and other inefficiencies. Meta reported 38-43% MFU for LLaMA 3 405B training on 16,384 H100 GPUs. This means that roughly 40% of the GPUs’ theoretical compute capacity was used for actual model computation, with the remaining 60% consumed by overhead.
An H100 SXM5 GPU has a theoretical peak of 989 TFLOPS for BF16 dense tensor operations (without structured sparsity; the sparsity-enabled peak is 1,979 TFLOPS, but training does not use structured sparsity). At 40% MFU, each GPU delivers approximately 396 TFLOPS of useful computation. Across 16,384 GPUs, this is approximately 6.5 exaFLOPS of effective compute.
Source: NVIDIA H100 Tensor Core GPU Datasheet: H100 SXM5 BF16 Tensor Core peak is 989 TFLOPS dense (1,979 TFLOPS with 2:4 structured sparsity). 80 GB HBM3, 3.35 TB/s memory bandwidth, 900 GB/s NVLink, 700W TDP.
The Cost of Pre-training
Pre-training a frontier language model is one of the most expensive computational tasks ever undertaken. The costs span hardware, energy, engineering, and data.
Hardware Costs
The primary hardware for LLM training in 2024-2025 was the NVIDIA H100 SXM5 GPU. Each H100 has a thermal design power (TDP) of 700 watts and costs between $25,000 and $40,000 depending on the purchasing arrangement. A single DGX H100 server (containing 8 H100 GPUs) costs approximately $300,000-$400,000.
For cloud-based training, H100 GPU-hours typically cost $2-3 per GPU-hour, though large customers negotiate significant discounts.
As of early 2026, the hardware landscape is shifting to NVIDIA’s Blackwell generation. The B200 (shipped in 2025) offers 192 GB of HBM3e memory and approximately 4x faster training throughput than the H100. The B300 (“Blackwell Ultra,” shipped January 2026) pushes further with 288 GB of HBM3e and approximately 13.5 PFLOPS of dense FP4 compute per chip (108 PFLOPS dense across 8 GPUs in a DGX B300 node, per NVIDIA’s official specifications). Frontier training runs starting in 2026 will increasingly use Blackwell hardware, but the H100 remains the reference point for all publicly documented training runs discussed in this chapter.
Source: NVIDIA B200 and B300 specifications from official product announcements. B200: 192 GB HBM3e, 8 TB/s bandwidth. B300: 288 GB HBM3e, 8 TB/s bandwidth, ~13.5 PFLOPS dense FP4 per chip (108 PFLOPS dense per DGX B300 node of 8 GPUs), 1,400W TDP. Shipped January 2026.
Real Training Costs
The Stanford AI Index Report (2025 edition) and Epoch AI have compiled estimates of frontier model training costs:
| Model | Year | Estimated Training Cost | GPU-Hours | Hardware |
|---|---|---|---|---|
| GPT-3 (175B) | 2020 | ~$4-5M | Not disclosed | NVIDIA V100 |
| GPT-4 | 2023 | ~$78-100M+ | Not disclosed | Not disclosed |
| Gemini Ultra 1.0 | 2023 | ~$191M | Not disclosed | Google TPU v4 |
| LLaMA 3 70B | 2024 | ~$13M (compute only) | ~6.5-7M H100-hours | NVIDIA H100 |
| LLaMA 3.1 405B | 2024 | ~$60-90M (compute only) | 30.84M H100-hours | 16,384 NVIDIA H100 |
| DeepSeek-V3 (671B) | 2024 | ~$5.6M (compute only) | 2.788M H800-hours | 2,048 NVIDIA H800 |
The total GPU-hours across all LLaMA 3 models (8B, 70B, and 405B) was approximately 39.3 million H100 GPU-hours, with the 405B model accounting for the vast majority at 30.84 million.
The DeepSeek-V3 figure of 2.788M H800 GPU-hours breaks down as approximately 2.664M hours for pre-training on 14.8 trillion tokens and 119K hours for context length extension to 128K tokens, with the remainder for post-training.
The DeepSeek-V3 cost stands out as remarkably low. At $5.6 million for a 671B parameter model trained on 14.8 trillion tokens, DeepSeek achieved frontier-level performance at a fraction of the cost of comparable models. This was enabled by their efficient MoE architecture (only 37B active parameters), FP8 training, and the auxiliary-loss-free load balancing described in Chapter 12. Notably, the DeepSeek-V3 training run was remarkably stable: the team reported no irrecoverable loss spikes and performed no rollbacks throughout the entire training process. However, the $5.6 million figure represents only the GPU compute cost for the final training run; it does not include the cost of research, failed experiments, data preparation, or the development of DeepSeek-V2 that informed V3’s design.
It is important to note that these cost figures represent only the compute cost of the final successful training run. The total cost of developing a frontier model is much higher, because it includes:
- Multiple failed or abandoned training runs
- Extensive hyperparameter search and scaling law experiments
- Data collection, processing, and curation
- Engineering salaries and infrastructure
- Post-training (fine-tuning, RLHF, safety training)
Anthropic’s CEO Dario Amodei stated in June 2024 (on the In Good Company podcast with Nicolai Tangen, episode published June 26, 2024) that AI models then in development cost up to $1 billion to train, and predicted that training costs would reach $10 billion or even $100 billion within three years.
Sources: Stanford HAI AI Index Report 2025: GPT-4 training cost estimated at $78M (compute only), Gemini Ultra 1.0 at $191M. Meta (arXiv:2407.21783): LLaMA 3.1 405B required 30.84M H100 GPU-hours; total across all LLaMA 3 models was approximately 39.3M H100 GPU-hours (1.5M for 8B, ~7M for 70B, ~31M for 405B). AWS SageMaker blog: LLaMA 3 70B required 6.5M H100 GPU-hours. DeepSeek-V3 technical report (arXiv:2412.19437): 2.788M H800 GPU-hours, approximately $5.576M compute cost; training was remarkably stable with no irrecoverable loss spikes or rollbacks. Dario Amodei, In Good Company with Nicolai Tangen podcast, episode published June 26, 2024: models in development cost up to $1B, predicted $10B-$100B within three years.
Energy Consumption
Training frontier models consumes enormous amounts of electricity. Each H100 GPU draws up to 700 watts at full load. A cluster of 16,384 H100 GPUs draws approximately 11.5 megawatts just for the GPUs, not counting networking, storage, cooling, and other infrastructure (which typically doubles the total power draw to 20-25 MW for the full data center load).
# Energy consumption estimate for LLaMA 3.1 405B training
gpus = 16_384
gpu_power_watts = 700 # H100 TDP
overhead_factor = 1.8 # PUE (Power Usage Effectiveness) for cooling, networking, etc.
# GPU power alone
gpu_power_mw = gpus * gpu_power_watts / 1e6
total_power_mw = gpu_power_mw * overhead_factor
# Training duration: 30.84M GPU-hours / 16,384 GPUs ≈ 1,882 hours ≈ 78 days
# (Meta reported reliability statistics over a 54-day observation window.
# The full training duration, including restarts and downtime, is not
# precisely disclosed. The 78-day figure is a lower-bound estimate
# assuming 100% utilization; actual wall-clock time was likely longer.)
training_days = 78
training_hours = training_days * 24
# Total energy
energy_mwh = total_power_mw * training_hours
energy_gwh = energy_mwh / 1000
print("Energy Consumption: LLaMA 3.1 405B Pre-training")
print(f" GPUs: {gpus:,} x H100 @ {gpu_power_watts}W")
print(f" GPU power: {gpu_power_mw:.1f} MW")
print(f" Total facility power (PUE {overhead_factor}): {total_power_mw:.1f} MW")
print(f" Training duration: ~{training_days} days")
print(f" Total energy: ~{energy_gwh:.1f} GWh")
print()
# Context: average US household uses about 10,500 kWh per year
households = energy_mwh * 1000 / 10_500
print(f" Equivalent to powering ~{households:,.0f} US homes for one year")
print(f" Or running a small city of ~{households:,.0f} people for a year")
# CO2 emissions (Meta reported 11,390 tons for 405B training)
print(f"\n Meta reported CO2 emissions: 11,390 tons CO2e")
print(f" (Meta offsets this with renewable energy purchases)")Meta reported that training LLaMA 3.1 405B produced the equivalent of 11,390 tons of CO2 emissions, though Meta states it offsets this through renewable energy purchases and maintains net-zero greenhouse gas emissions for its global operations.
Source: Meta (arXiv:2407.21783): 11,390 tons CO2e for LLaMA 3.1 405B training. The Register, July 23, 2024: “Meta says all new Llama 3.1 405B model bests OpenAI’s GPT-4.”
Reliability: When Hardware Fails
At the scale of thousands of GPUs running for months, hardware failures are not exceptional events; they are routine. Meta’s LLaMA 3 technical report provides the most detailed public account of training reliability at scale.
During a 54-day snapshot of LLaMA 3 405B pre-training on 16,384 H100 GPUs, Meta experienced:
- 466 total job interruptions (419 unexpected, 47 planned maintenance)
- That is approximately one unexpected failure every 3 hours
- 148 interruptions (30.1%) were caused by GPU faults (including NVLink failures)
- 72 interruptions (17.2%) were caused by HBM3 memory failures
- Only 2 CPU failures occurred during the entire period
- GPU and HBM3 issues together accounted for over half of all unexpected failures
Despite this failure rate, Meta achieved approximately 90% effective training uptime. This was possible because of several reliability engineering practices:
Checkpointing
The model’s state (weights, optimizer state, learning rate schedule position, data loader position) is periodically saved to persistent storage. When a failure occurs, training resumes from the most recent checkpoint rather than starting over. The frequency of checkpointing involves a tradeoff: more frequent checkpoints mean less work is lost per failure, but checkpointing itself takes time and consumes storage bandwidth.
At the scale of LLaMA 3 405B, a single checkpoint is approximately 6.5 TB (the full model and optimizer state). Writing this to storage takes significant time, even with high-bandwidth parallel file systems.
Automated Failure Detection and Recovery
Meta’s training infrastructure automatically detects hardware failures, identifies the affected GPU(s), removes them from the training job, and restarts training from the most recent checkpoint. This process is largely automated, minimizing the time between failure and recovery.
Redundancy and Hot Spares
Large training clusters maintain spare GPUs that can be swapped in when failures occur. The cluster management software handles the reassignment of work to replacement hardware.
Diurnal Variations
An interesting detail from Meta’s report: they observed 1-2% throughput variations correlated with the time of day, caused by environmental temperature fluctuations affecting GPU clock speeds. Warmer ambient temperatures (during the day) caused GPUs to throttle slightly, reducing throughput. This level of sensitivity illustrates how precisely these systems must be monitored and managed.
Source: Meta (arXiv:2407.21783): 466 interruptions over 54 days, 419 unexpected. 148 GPU faults, 72 HBM3 failures. 90% effective uptime. 1-2% diurnal throughput variation.
The Training Timeline: Phases of Pre-training
Pre-training is not a single monolithic process. It typically proceeds in multiple phases, each with different configurations:
Phase 1: Main Pre-training (Short Context)
The bulk of training happens with a relatively short context length (typically 4K-8K tokens). This is computationally efficient because attention cost scales quadratically with sequence length (as discussed in Chapter 7). LLaMA 3 405B was trained primarily with an 8K context length during this phase.
During this phase, the batch size is typically ramped up gradually. LLaMA 3 started with a smaller batch size for training stability and increased it during training for efficiency. The learning rate follows the warmup-then-cosine-decay schedule described earlier.
This phase consumes the vast majority of the training compute and processes the vast majority of the training tokens.
Phase 2: Context Length Extension
After the main pre-training phase, the context length is gradually increased. LLaMA 3 extended from 8K to 128K tokens over approximately 800 billion additional tokens of training. This is done gradually (not in a single jump) to allow the model to adapt its positional encodings (RoPE, as described in Chapter 6) to longer sequences.
During this phase, the training data is enriched with longer documents to ensure the model sees examples that actually use the extended context. The batch size (measured in tokens) may be adjusted to accommodate the longer sequences.
Phase 3: Annealing
Some training runs include a final annealing phase where the learning rate is reduced to a very small value and the data mix is adjusted to emphasize high-quality data. This phase is designed to squeeze out the last bit of performance by fine-tuning the model on the best available data at a very low learning rate, allowing the model to settle into a good minimum of the loss landscape.
Meta’s LLaMA 3 report describes an annealing phase on the final 40 million tokens, where they linearly annealed the learning rate to zero while upsampling very high-quality data sources (particularly code and mathematical data). They also averaged model checkpoints produced during annealing to obtain the final pre-trained model, a technique that smooths out noise from individual checkpoints. Interestingly, Meta found that annealing improved benchmark scores for the smaller LLaMA 3 8B model (on GSM8k and MATH) but had limited impact on the 405B model, suggesting that larger models may already be closer to their optimal loss landscape minimum before annealing begins.
Monitoring Training
Throughout all phases, the training team monitors several key metrics:
Training loss: The primary metric. It should decrease smoothly over time, following the scaling law predictions from Chapter 13. Sudden spikes in loss indicate problems (data issues, hardware failures, numerical instability).
Gradient norm: The magnitude of the gradients. Extremely large gradient norms indicate instability; extremely small norms indicate the model has stopped learning. Gradient clipping (capping the gradient norm at a maximum value) is used to prevent instability.
Learning rate: Tracked to ensure the schedule is proceeding as planned.
Throughput: Tokens processed per second. Drops in throughput indicate hardware issues or communication bottlenecks.
Evaluation metrics: Periodically, the model is evaluated on held-out benchmarks to track capability development. This is done infrequently (e.g., every few thousand steps) because evaluation is expensive and interrupts training.
Loss Spikes and Recovery
Training loss does not always decrease smoothly. Occasionally, the loss spikes sharply upward, indicating that the model has encountered a problematic batch of data or has entered an unstable region of the loss landscape. When this happens, the training team may:
- Roll back to a recent checkpoint (before the spike)
- Skip the problematic data batch
- Reduce the learning rate temporarily
- Investigate the cause (data quality issue, numerical overflow, hardware problem)
Meta’s LLaMA 3 report mentions that they developed automated systems to detect and recover from loss spikes, minimizing the need for manual intervention.
A Complete Training Run: Putting It All Together
Let’s trace the full lifecycle of a frontier model training run, using LLaMA 3.1 405B as our reference:
# Timeline of a frontier model training run (LLaMA 3.1 405B)
phases = [
("Research & scaling experiments", "6-12 months",
"Train hundreds of small models (100M-1B params) to fit scaling laws.\n"
"Determine optimal model size, data mix, and hyperparameters.\n"
"Estimated cost: $1-5M in compute for experiments."),
("Data preparation", "3-6 months (overlapping)",
"Process Common Crawl snapshots through the full pipeline.\n"
"Collect and process code, books, multilingual data.\n"
"Build quality classifiers, run deduplication.\n"
"Produce ~15T tokens of training-ready data."),
("Infrastructure setup", "1-3 months (overlapping)",
"Configure 16,384 H100 GPUs across ~2,048 servers.\n"
"Set up 4D parallelism (TP=8, PP=16, DP=128).\n"
"Test communication, checkpointing, failure recovery.\n"
"Validate with short training runs."),
("Main pre-training (8K context)", "~78 days",
"Process ~15.6T tokens at 8K context length.\n"
"Batch size ramped up gradually for efficiency.\n"
"30.84M H100 GPU-hours total.\n"
"419 unexpected hardware failures (in a 54-day reporting window), 90% uptime.\n"
"Estimated compute cost: $60-90M."),
("Context extension (8K -> 128K)", "~2 weeks",
"Gradually extend context from 8K to 128K tokens.\n"
"Train on ~800B additional tokens with long documents.\n"
"Enable context parallelism for long sequences."),
("Evaluation & validation", "1-2 weeks",
"Run comprehensive benchmark suite.\n"
"Compare to scaling law predictions.\n"
"Identify capability gaps for post-training."),
]
print("Frontier Model Training Run: LLaMA 3.1 405B")
print("=" * 65)
for i, (phase, duration, details) in enumerate(phases, 1):
print(f"\nPhase {i}: {phase} ({duration})")
print("-" * 65)
for line in details.strip().split("\n"):
print(f" {line}")
print("\n" + "=" * 65)
print("Total timeline: approximately 12-18 months from research start")
print("to pre-trained model ready for post-training (Chapter 15).")How Pre-training Differs from What Comes After
Pre-training produces a base model (also called a foundation model): a model that has learned the statistical patterns of language from trillions of tokens but has not been taught to follow instructions, be helpful, or avoid harmful outputs. A base model is a powerful text completion engine, but it is not yet a useful assistant.
If you prompt a base model with “What is the capital of France?”, it might respond with:
- “What is the capital of Germany? What is the capital of Italy?” (continuing the pattern of questions)
- “The capital of France is Paris. The capital of France is a common trivia question…” (providing the answer but then rambling)
- Something else entirely, depending on what patterns in the training data match the prompt
The base model has the knowledge (it knows Paris is the capital of France), but it has not learned the behavior of answering questions helpfully and concisely. That behavior comes from the post-training process described in Chapter 15: supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and other alignment techniques.
This distinction is important: pre-training provides the knowledge and capabilities; post-training shapes the behavior. A model that is poorly pre-trained cannot be rescued by post-training (you cannot teach a model facts it never learned). But a well-pre-trained model that is poorly post-trained will be knowledgeable but unhelpful or unsafe.
Key Takeaways
Pre-training is the process of training a language model on trillions of tokens of text to learn the statistical patterns of language. The objective is simple: predict the next token. This single objective, applied at massive scale, produces models that internalize knowledge about language, facts, reasoning, and code.
The cross-entropy loss measures how well the model predicts the next token. A random model on a 128K vocabulary has a loss of approximately 11.76 nats. Frontier models achieve losses of 1.7-2.0 nats on web text, approaching the estimated irreducible loss (the entropy of natural language).
Training data is assembled from multiple sources: web text (primarily from Common Crawl), code (from GitHub and similar repositories), books, academic papers, multilingual content, and increasingly, synthetic data generated by earlier models. Raw data goes through an extensive pipeline of extraction, language identification, quality filtering, deduplication (using techniques like MinHash), toxicity removal, and data mixing. Meta disclosed the LLaMA 3 data mix: approximately 50% general knowledge, 25% math and reasoning, 17% code, and 8% multilingual data.
The data processing pipeline is critical. HuggingFace’s FineWeb dataset demonstrates that a single Common Crawl snapshot of 2.4 billion pages yields only a fraction of usable training tokens after filtering. FineWeb processed 96 snapshots to produce 15 trillion tokens, with each pipeline step measurably improving model quality.
AdamW is the standard optimizer for LLM pre-training. It maintains two state variables (first and second moment estimates) per parameter, requiring approximately 16 bytes per parameter total during training (weights + optimizer state + gradients in mixed precision). For a 405B model, this is approximately 6.5 TB of memory.
Mixed precision training using BF16 (and increasingly FP8) reduces memory usage and increases throughput. BF16 has become the standard for LLM training due to its wide dynamic range, which prevents overflow and underflow.
Distributed training uses 4D parallelism to split the work across thousands of GPUs: tensor parallelism (within a server), pipeline parallelism (across servers), data parallelism (across groups of servers), and context parallelism (for long sequences). Meta trained LLaMA 3.1 405B on 16,384 H100 GPUs, achieving 38-43% Model FLOPs Utilization.
Training costs for frontier models range from $5.6 million (DeepSeek-V3, compute only, using efficient MoE and FP8) to $78-191 million (GPT-4 and Gemini Ultra, compute estimates from Stanford AI Index 2025). Total development costs including research, failed runs, and engineering are much higher, with Anthropic’s CEO estimating $1 billion for frontier model development.
Hardware reliability is a major challenge at scale. During LLaMA 3 405B training on 16,384 H100 GPUs, Meta experienced 419 unexpected failures in 54 days (one every 3 hours), with GPU and HBM3 memory issues causing over half of failures. Checkpointing, automated recovery, and redundancy enabled 90% effective training uptime.
Energy consumption is substantial. A 16,384-GPU H100 cluster draws approximately 11.5 MW for GPUs alone (20+ MW with cooling and infrastructure). LLaMA 3.1 405B training produced an estimated 11,390 tons of CO2 emissions.
Pre-training produces a base model that has learned language patterns and knowledge but has not been taught to follow instructions or be helpful. The base model is the foundation; post-training (Chapter 15) shapes its behavior into a useful assistant.
What’s Next
You now understand how pre-training works: the next-token prediction objective, the data pipeline from raw web crawls to training-ready tokens, the mechanics of distributed training across thousands of GPUs, and the staggering costs in compute, energy, and engineering. In Chapter 15, we will explore what happens after pre-training: how supervised fine-tuning, reinforcement learning from human feedback (RLHF), and other alignment techniques transform a raw base model into a helpful, harmless, and honest assistant.