Chapter 11. Model Sizes and What They Mean
When someone says a model has “70 billion parameters,” what does that actually mean? How does that translate into real-world capabilities, memory requirements, and cost? In this chapter, we will break down exactly what parameters are, how they add up across the components you learned about in Chapters 5 through 10, and what the numbers look like for every major model family as of March 2026. We will also confront a reality of the current AI landscape: the biggest labs increasingly keep their architecture details secret, and we will explain why.
What Is a Parameter?
A parameter is a single number in a weight matrix. That is the entire definition. Every weight matrix you have seen in this book (embedding tables, attention projections, FFN layers, normalization scales) is made up of individual floating-point numbers, and each one of those numbers is a parameter.
When we say LLaMA 3 8B has “8 billion parameters,” we mean there are approximately 8 billion individual numbers stored in the model’s weight matrices. During training, each of these numbers is adjusted by gradient descent (Chapter 3) to minimize the loss function. During inference, these numbers are fixed: the model reads them from disk, loads them into GPU memory, and uses them to compute predictions.
Parameters are not code. They are not rules. They are just numbers, learned from data. The “intelligence” of a language model is entirely encoded in the specific values of its billions of parameters. Two models with identical architectures but different parameter values will behave completely differently, because the parameters determine what the model has learned.
Where Parameters Live
In Chapters 5 through 10, you learned about every component of a Transformer. Here is where the parameters are:
| Component | What the parameters are | Typical count per layer |
|---|---|---|
| Embedding table (Ch. 5) | One vector per vocabulary token | vocab_size x hidden_size |
| Attention Q projection (Ch. 8) | Matrix mapping hidden state to queries | hidden_size x (num_q_heads x head_dim) |
| Attention K projection (Ch. 8) | Matrix mapping hidden state to keys | hidden_size x (num_kv_heads x head_dim) |
| Attention V projection (Ch. 8) | Matrix mapping hidden state to values | hidden_size x (num_kv_heads x head_dim) |
| Attention O projection (Ch. 8) | Matrix combining head outputs | (num_q_heads x head_dim) x hidden_size |
| FFN W_gate (Ch. 9) | Gate projection for SwiGLU | hidden_size x intermediate_size |
| FFN W_up (Ch. 9) | Up projection for SwiGLU | hidden_size x intermediate_size |
| FFN W_down (Ch. 9) | Down projection for SwiGLU | intermediate_size x hidden_size |
| RMSNorm gamma (Ch. 10) | Scale parameter, one per dimension | hidden_size (x2 per layer) |
| Output projection | Matrix mapping hidden state to vocabulary | hidden_size x vocab_size |
The embedding table and output projection are model-wide (not per-layer). Everything else is repeated for every Transformer layer. A model with 32 layers has 32 copies of the attention weights, 32 copies of the FFN weights, and 64 RMSNorm layers (2 per Transformer block), plus one final RMSNorm before the output projection.
Counting Parameters: A Complete Walkthrough
Let’s count every parameter in LLaMA 3 8B, the model we have been using as our primary example throughout this book. This will make the concept of “8 billion parameters” completely concrete.
LLaMA 3 8B Architecture
From Meta’s release (April 18, 2024):
vocab_size: 128,256
hidden_size: 4,096
num_hidden_layers: 32
num_attention_heads: 32
num_key_value_heads: 8
head_dim: 128
intermediate_size: 14,336
tie_word_embeddings: FalseStep 1: Embedding Table
The embedding table maps each token ID to a vector of size hidden_size:
128,256 x 4,096 = 525,336,576 parameters (525.3M)Step 2: Per-Layer Attention Parameters
LLaMA 3 8B uses Grouped Query Attention (Chapter 8) with 32 query heads and 8 KV heads:
W_Q: 4,096 x (32 x 128) = 4,096 x 4,096 = 16,777,216
W_K: 4,096 x (8 x 128) = 4,096 x 1,024 = 4,194,304
W_V: 4,096 x (8 x 128) = 4,096 x 1,024 = 4,194,304
W_O: (32 x 128) x 4,096 = 4,096 x 4,096 = 16,777,216
Total = 41,943,040 (42.0M)Step 3: Per-Layer FFN Parameters
LLaMA 3 8B uses SwiGLU (Chapter 9) with three weight matrices:
W_gate: 4,096 x 14,336 = 58,720,256
W_up: 4,096 x 14,336 = 58,720,256
W_down: 14,336 x 4,096 = 58,720,256
Total = 176,160,768 (176.2M)Step 4: Per-Layer RMSNorm Parameters
Two RMSNorm layers per block, each with hidden_size gamma parameters:
2 x 4,096 = 8,192 (0.008M)Step 5: Per-Layer Total
Attention: 41,943,040
FFN: 176,160,768
RMSNorm: 8,192
Layer total: 218,112,000 (218.1M)Step 6: All Layers
32 layers x 218,112,000 = 6,979,584,000 (6.98B)Step 7: Model-Wide Components
Embedding table: 525,336,576
Final RMSNorm: 4,096
Output projection: 525,336,576 (separate from embedding, not tied)Step 8: Grand Total
Layers: 6,979,584,000
Embedding: 525,336,576
Final RMSNorm: 4,096
Output projection: 525,336,576
─────────────
Total: 8,030,261,248 (~8.03 billion)That is where the “8B” comes from. Every single one of those 8.03 billion numbers was learned during training on trillions of tokens of text.
Source: LLaMA 3 8B architecture from Meta (April 18, 2024). Configuration from HuggingFace Transformers: vocab_size=128,256, hidden_size=4,096, intermediate_size=14,336, num_attention_heads=32, num_key_value_heads=8, head_dim=128, num_hidden_layers=32, tie_word_embeddings=False.
Parameter Distribution
Let’s visualize where those 8 billion parameters actually live:
import numpy as np
# LLaMA 3 8B parameter breakdown
vocab_size = 128_256
hidden_size = 4_096
num_layers = 32
num_q_heads = 32
num_kv_heads = 8
head_dim = 128
intermediate_size = 14_336
# Per-layer counts
attn_per_layer = (
hidden_size * num_q_heads * head_dim + # W_Q
hidden_size * num_kv_heads * head_dim + # W_K
hidden_size * num_kv_heads * head_dim + # W_V
num_q_heads * head_dim * hidden_size # W_O
)
ffn_per_layer = 3 * hidden_size * intermediate_size # W_gate, W_up, W_down
norm_per_layer = 2 * hidden_size # 2 RMSNorm layers
# Model-wide counts
embedding = vocab_size * hidden_size
output_proj = vocab_size * hidden_size # not tied
final_norm = hidden_size
# Totals
total_layers = num_layers * (attn_per_layer + ffn_per_layer + norm_per_layer)
total = embedding + total_layers + final_norm + output_proj
print("LLaMA 3 8B Parameter Breakdown")
print("=" * 55)
print(f"{'Component':<30} {'Parameters':>12} {'Share':>8}")
print("-" * 55)
print(f"{'Embedding table':<30} {embedding:>12,} {embedding/total:>8.1%}")
print(f"{'Attention (32 layers)':<30} {num_layers*attn_per_layer:>12,} {num_layers*attn_per_layer/total:>8.1%}")
print(f"{'FFN (32 layers)':<30} {num_layers*ffn_per_layer:>12,} {num_layers*ffn_per_layer/total:>8.1%}")
print(f"{'RMSNorm (all)':<30} {num_layers*norm_per_layer+final_norm:>12,} {(num_layers*norm_per_layer+final_norm)/total:>8.1%}")
print(f"{'Output projection':<30} {output_proj:>12,} {output_proj/total:>8.1%}")
print("-" * 55)
print(f"{'TOTAL':<30} {total:>12,}")
print(f"\nThat is {total/1e9:.2f} billion parameters.")When you run this, you will see that the FFN layers dominate (about 70% of total parameters), followed by the attention layers (about 17%), with the embedding and output projection accounting for the remaining 13%. The RMSNorm parameters are negligible (less than 0.01%).
The Hidden Dimension: The Most Important Number
If you had to pick a single number that defines a model’s “size class,” it would be the hidden dimension (also called hidden_size, d_model, or model dimension). This is the width of the vector that represents each token as it flows through the Transformer. Every component in the model is sized relative to this number:
- The embedding table has hidden_size columns
- The attention projections map to and from hidden_size
- The FFN expands from hidden_size and contracts back to hidden_size
- The output projection maps from hidden_size to the vocabulary
A larger hidden dimension means each token is represented by a longer vector with more dimensions, which gives the model more capacity to encode nuanced information about each token. It also means every weight matrix in the model is larger, which is why the hidden dimension has such a dramatic effect on total parameter count.
Hidden Dimensions in Production Models
| Model | hidden_size | Category |
|---|---|---|
| GPT-2 Small | 768 | Tiny |
| Mistral 7B | 4,096 | Small |
| LLaMA 3 8B | 4,096 | Small |
| LLaMA 4 Scout / Maverick | 5,120 | Medium |
| DeepSeek-V3 | 7,168 | Medium-Large |
| LLaMA 3 70B | 8,192 | Large |
| Qwen 2.5 72B | 8,192 | Large |
| GPT-3 | 12,288 | Very Large |
| Mistral Large 2 | 12,288 | Very Large |
| LLaMA 3.1 405B | 16,384 | Frontier |
Sources: GPT-2 from Radford et al. (2019). Mistral 7B: hidden_size=4,096 (Mistral AI, September 2023). LLaMA 3 8B: hidden_size=4,096 (Meta, April 2024). LLaMA 4: hidden_size=5,120 (HuggingFace Transformers Llama4TextConfig). DeepSeek-V3: hidden_size=7,168 (arXiv:2412.19437). LLaMA 3 70B: hidden_size=8,192 (Meta, April 2024). Qwen 2.5 72B: hidden_size=8,192 (HuggingFace config.json). GPT-3: d_model=12,288 (Brown et al., 2020). Mistral Large 2: hidden_size=12,288 (Ollama model metadata, 88 layers, 96 attention heads). LLaMA 3.1 405B: hidden_size=16,384 (Meta, July 2024, config.json).
Why Parameters Scale Quadratically with Hidden Dimension
The total parameter count of a Transformer scales roughly as the square of the hidden dimension. Here is why:
The attention projections (W_Q, W_K, W_V, W_O) each have shape [hidden_size x something proportional to hidden_size]. For standard multi-head attention, W_Q and W_O are both [hidden_size x hidden_size], so each contributes hidden_size^2 parameters. The FFN matrices are [hidden_size x intermediate_size], and intermediate_size is typically 3 to 4 times hidden_size, so each FFN matrix contributes roughly 3.5 x hidden_size^2 parameters.
This means doubling the hidden dimension roughly quadruples the parameter count per layer. A model with hidden_size=8,192 has approximately 4x the parameters per layer as a model with hidden_size=4,096 (assuming the same number of heads and expansion ratio).
Let’s verify this with real numbers:
import numpy as np
def count_layer_params(hidden, n_q, n_kv, head_dim, inter):
"""Count parameters in one Transformer layer."""
attn = (hidden * n_q * head_dim + # W_Q
hidden * n_kv * head_dim + # W_K
hidden * n_kv * head_dim + # W_V
n_q * head_dim * hidden) # W_O
ffn = 3 * hidden * inter # SwiGLU: W_gate, W_up, W_down
norm = 2 * hidden # 2 RMSNorm
return attn, ffn, norm
models = [
("LLaMA 3 8B", 4096, 32, 8, 128, 14336, 32),
("LLaMA 3 70B", 8192, 64, 8, 128, 28672, 80),
("LLaMA 3.1 405B", 16384, 128, 16, 128, 53248, 126),
]
print(f"{'Model':<18} {'hidden':>7} {'Layers':>7} {'Params/Layer':>14} {'Total Params':>14}")
print("-" * 65)
for name, h, nq, nkv, hd, inter, layers in models:
attn, ffn, norm = count_layer_params(h, nq, nkv, hd, inter)
per_layer = attn + ffn + norm
# Approximate total (layers + embedding + output)
vocab = 128_256
total = layers * per_layer + 2 * vocab * h + h # embed + output + final norm
print(f"{name:<18} {h:>7,} {layers:>7} {per_layer:>14,} {total:>14,}")
print(f"{'':18} {'':>7} {'':>7} ({per_layer/1e6:>8.1f}M) ({total/1e9:>8.2f}B)")This code shows how the per-layer parameter count grows dramatically with hidden dimension. LLaMA 3.1 405B has roughly 16x the parameters per layer compared to LLaMA 3 8B, driven primarily by the 4x larger hidden dimension (16,384 vs. 4,096) and the correspondingly larger FFN.
Real Model Sizes as of March 2026
The landscape of language models in March 2026 spans from tiny models that run on a phone to frontier models that require entire data centers. Here is a comprehensive survey of the major model families, organized by size class.
Open-Weight Models (Architecture Details Published)
These models have publicly available weights and documented architectures. We can verify every number.
| Model | Total Params | Active Params | Layers | hidden_size | Architecture | Release |
|---|---|---|---|---|---|---|
| Mistral 7B | 7.3B | 7.3B (dense) | 32 | 4,096 | Dense Transformer | Sep 2023 |
| LLaMA 3 8B | 8.0B | 8.0B (dense) | 32 | 4,096 | Dense Transformer | Apr 2024 |
| Qwen 2.5 7B | 7.6B | 7.6B (dense) | 28 | 3,584 | Dense Transformer | Sep 2024 |
| Mixtral 8x7B | 46.7B | 12.9B | 32 | 4,096 | MoE (8 experts, top-2) | Dec 2023 |
| Qwen 2.5 72B | 72.7B | 72.7B (dense) | 80 | 8,192 | Dense Transformer | Sep 2024 |
| LLaMA 3 70B | 70.6B | 70.6B (dense) | 80 | 8,192 | Dense Transformer | Apr 2024 |
| Mistral Large 2 | 123B | 123B (dense) | 88 | 12,288 | Dense Transformer | Jul 2024 |
| Mixtral 8x22B | 141B | 39B | 56 | 6,144 | MoE (8 experts, top-2) | Apr 2024 |
| LLaMA 4 Scout | 109B | 17B | 48 | 5,120 | MoE (16 experts, top-1) | Apr 2025 |
| Qwen 3 235B-A22B | 235B | 22B | 94 | 4,096 | MoE (128 experts, top-8) | Apr 2025 |
| LLaMA 4 Maverick | 400B | 17B | 48 | 5,120 | MoE (128 experts, top-1) | Apr 2025 |
| LLaMA 3.1 405B | 405B | 405B (dense) | 126 | 16,384 | Dense Transformer | Jul 2024 |
| DeepSeek-V3 | 671B | 37B | 61 | 7,168 | MoE (256+1 experts, top-8) | Dec 2024 |
| Qwen 3.5 397B-A17B | 397B | 17B | 60 | 4,096 | MoE (512 experts, top-10), hybrid attention | Feb 2026 |
Sources: Mistral 7B from Mistral AI (September 27, 2023): 32 layers, hidden_size=4,096, intermediate_size=14,336, vocab_size=32,000 (HuggingFace config.json, Mistral-7B-v0.1). LLaMA 3 8B from Meta (April 18, 2024). Qwen 2.5 7B from Alibaba (September 2024): 28 layers, hidden_size=3,584, intermediate_size=18,944, vocab_size=152,064, 28 query heads, 4 KV heads (HuggingFace config.json). Mixtral 8x7B from Mistral AI (December 2023): 46.7B total, 12.9B active, 8 experts with top-2 routing. Qwen 2.5 72B from Alibaba (September 2024): 80 layers, hidden_size=8,192, intermediate_size=29,568, vocab_size=152,064 (HuggingFace config.json). LLaMA 3 70B from Meta (April 2024): 80 layers, hidden_size=8,192, intermediate_size=28,672, 64 query heads, 8 KV heads. Mistral Large 2 from Mistral AI (July 24, 2024): 123B parameters, 88 layers, hidden_size=12,288, 96 attention heads, 8 KV heads, intermediate_size=28,672, vocab_size=32,768, 128K context (Ollama model metadata). Mixtral 8x22B from Mistral AI (April 2024): 141B total, 39B active. LLaMA 4 Scout/Maverick from Meta (April 5, 2025): 48 layers, hidden_size=5,120, 17B active. Qwen 3 235B-A22B from Alibaba (April 29, 2025): 94 layers, hidden_size=4,096, intermediate_size=12,288, moe_intermediate_size=1,536, vocab_size=151,936, 64 query heads, 4 KV heads, 128 experts, top-8 routing (HuggingFace config.json). LLaMA 3.1 405B from Meta (July 23, 2024): 126 layers, hidden_size=16,384, intermediate_size=53,248, 128 query heads, 16 KV heads. DeepSeek-V3 from DeepSeek (December 26, 2024): 61 layers, hidden_size=7,168, 671B total, 37B active (arXiv:2412.19437). Qwen 3.5 397B-A17B from Alibaba (February 16, 2026): 60 layers, hidden_size=4,096, moe_intermediate_size=1,024, vocab_size=248,320, 32 query heads, 2 KV heads, head_dim=256, 512 experts, top-10 routing, hybrid linear/full attention architecture, natively multimodal, Apache 2.0 license (HuggingFace config.json).
Closed-Source Models (Architecture Details Not Published)
These models are only accessible through APIs. The companies behind them do not publish architecture details, parameter counts, or training data composition.
| Model | Estimated Size | What We Know | Release |
|---|---|---|---|
| GPT-4 | ~1.8T total (leaked) | Rumored MoE with 16 experts of ~111B each | Mar 2023 |
| GPT-4o | Not disclosed | Unified multimodal Transformer, 128K context | May 2024 |
| GPT-5 | Not disclosed | 400K context, three tiers (GPT-5, Mini, Nano) | Aug 2025 |
| Claude Sonnet 4 | Not disclosed | 200K context (expanded to 1M via API, August 12, 2025), extended thinking | May 2025 |
| Claude Sonnet 4.6 | Not disclosed | 1M context, near-Opus performance | Feb 2026 |
| Gemini 2.5 Pro | Not disclosed | 1M context, multimodal, advanced reasoning | Mar 2025 |
| Grok 3 | ~3T (estimated) | 1M context (marketed; practical API limit ~131K), MoE architecture, trained on 200K H100 GPUs | Feb 2025 |
The GPT-4 architecture details come from unverified leaks in mid-2023, which claimed approximately 1.8 trillion total parameters across 120 layers using a Mixture-of-Experts design with 16 experts of roughly 111 billion parameters each. OpenAI has never confirmed or denied these numbers. For GPT-5 and later models, OpenAI has not disclosed architecture details. Grok 3’s parameter count of approximately 3 trillion comes from Elon Musk’s statement to Ron Baron in November 2025, where he said “Grok-3 and -4 are based on a 3 trillion parameter model.” xAI has not published official architecture details. The marketed 1 million token context window appears to have a practical API ceiling of approximately 131,000 tokens, based on developer reports and API documentation.
Sources: GPT-4 leaked details from multiple reports (July 2023), unverified. GPT-4 technical report (arXiv:2303.08774, March 2023) explicitly states: “this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” GPT-4o released May 2024, 128K context. GPT-5 released August 7, 2025, per OpenAI; 400K context, three tiers (GPT-5, Mini, Nano). Claude Sonnet 4 released May 22, 2025, per Anthropic; 200K context (expanded to 1M on August 12, 2025, per Anthropic announcement). Claude Sonnet 4.6 released February 17, 2026, per Anthropic; 1M context. Gemini 2.5 Pro from Google DeepMind (experimental March 25, 2025; GA June 17, 2025), 1M context. Grok 3 from xAI (February 17, 2025), trained on Colossus supercluster with 200,000 H100 GPUs; ~3T parameter estimate per Elon Musk’s statement to Ron Baron (November 2025, reported by Benzinga and LifeArchitect.ai); 1M context marketed, practical API limit ~131K tokens per developer reports and Oracle API documentation.
Models Announced but Not Released
| Model | Reported Size | Status |
|---|---|---|
| LLaMA 4 Behemoth | ~2T total, 288B active, 16 experts | Effectively shelved; Meta shifted focus to “Avocado” proprietary model |
Source: LLaMA 4 Behemoth announced by Meta in April 2025 with approximately 2 trillion total parameters and 288 billion active parameters across 16 experts. One source (Glenn Klockwood) describes it as “aborted before it was ever released due to poor performance.” Multiple reports from mid-2025 describe repeated delays (from summer to fall 2025 and beyond) due to performance falling short of expectations. As of March 2026, Meta has shifted focus to a proprietary model codenamed “Avocado,” which itself has been delayed from a planned March 2026 debut to at least May 2026 after internal benchmarks showed it trailing competitors from Google and OpenAI (per The New York Times, March 12, 2026, and Reuters).
Why Major Labs Don’t Publish Architecture Details
If you look at the table of closed-source models above, you will notice a pattern: none of the major commercial AI labs (OpenAI, Anthropic, Google DeepMind) publish the architecture details of their frontier models. This was not always the case. OpenAI published the full architecture of GPT-2 (2019) and GPT-3 (2020), including parameter counts, layer counts, hidden dimensions, and training details. Google published the original Transformer paper (2017) with complete architecture specifications.
The shift toward secrecy began around 2022-2023, driven by two factors:
Competitive pressure: As the commercial value of frontier models became clear, companies began treating architecture details as trade secrets. If a competitor knows your exact architecture, they can replicate it more easily. OpenAI’s GPT-4 technical report (March 2023) explicitly stated: “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”
Safety concerns: Some labs argue that publishing detailed architecture information makes it easier for bad actors to build dangerous systems. This argument is controversial; many researchers believe that openness enables better safety research.
The result is a two-tier landscape. Open-weight models (LLaMA, Mistral, DeepSeek, Qwen) publish their architectures and release their weights, allowing anyone to inspect, modify, and run them. Closed-source models (GPT-4/5, Claude, Gemini) are accessible only through APIs, and their internal details are unknown or based on unverified leaks.
For this book, we focus primarily on open-weight models when discussing specific architecture details, because those are the numbers we can verify. When we reference closed-source models, we clearly distinguish between confirmed facts (context window sizes, API pricing, benchmark scores) and unverified claims (parameter counts, architecture details).
Source: GPT-4 Technical Report, arXiv:2303.08774, March 2023. The quote about withholding architecture details appears on page 2 of the report.
Dense vs. MoE: Total Parameters vs. Active Parameters
One of the most important distinctions in the model size table above is between total parameters and active parameters. This distinction matters because of Mixture-of-Experts (MoE) architectures, which we will cover in depth in Chapter 12.
In a dense model like LLaMA 3 8B or LLaMA 3.1 405B, every parameter is used for every token. When the model processes a token, all 8 billion (or 405 billion) parameters participate in the computation. Total parameters equals active parameters.
In a MoE model like LLaMA 4 Maverick or DeepSeek-V3, only a fraction of the parameters are used for any given token. The model has many “expert” FFN blocks (Chapter 9), but a router selects only a few of them for each token. The rest sit idle.
Consider LLaMA 4 Maverick:
- Total parameters: 400 billion (all the weights stored on disk and loaded into memory)
- Active parameters: 17 billion (the weights actually used to process each token)
The 400B total includes 128 routed expert FFN blocks per MoE layer, but only 1 routed expert (plus 1 shared expert) is activated per token. The attention layers and shared experts are always active, contributing to the 17B active count.
This distinction has major practical implications:
- Memory: You need enough GPU memory to hold all 400B parameters, even though only 17B are used per token. At float16 precision, that is approximately 800 GB just for the weights.
- Compute: The computational cost per token is proportional to the active parameters (17B), not the total parameters (400B). This is why MoE models can achieve “big model quality at small model cost.”
- Quality: The model’s knowledge capacity is related to the total parameters (400B), because different experts can store different knowledge. The model has access to all 400B parameters’ worth of knowledge, even though it only uses 17B parameters’ worth of computation per token.
import numpy as np
# Compare dense vs MoE models
models = [
("LLaMA 3 8B", 8.0, 8.0, "Dense"),
("LLaMA 3 70B", 70.6, 70.6, "Dense"),
("LLaMA 3.1 405B", 405.0, 405.0, "Dense"),
("Mixtral 8x7B", 46.7, 12.9, "MoE"),
("Mixtral 8x22B", 141.0, 39.0, "MoE"),
("LLaMA 4 Scout", 109.0, 17.0, "MoE"),
("Qwen 3 235B-A22B",235.0, 22.0, "MoE"),
("LLaMA 4 Maverick",400.0, 17.0, "MoE"),
("Qwen 3.5 397B", 397.0, 17.0, "MoE"),
("DeepSeek-V3", 671.0, 37.0, "MoE"),
]
print(f"{'Model':<22} {'Total':>8} {'Active':>8} {'Ratio':>8} {'Type':<6}")
print("-" * 58)
for name, total, active, arch in models:
ratio = active / total
print(f"{name:<22} {total:>7.1f}B {active:>7.1f}B {ratio:>7.1%} {arch:<6}")
print("\nKey insight: MoE models store far more knowledge (total params)")
print("while using similar compute per token (active params) as smaller dense models.")
print(f"\nDeepSeek-V3 has {671/37:.0f}x more total params than active params.")
print(f"LLaMA 4 Maverick has {400/17:.0f}x more total params than active params.")The output reveals a striking pattern: MoE models achieve enormous total parameter counts (and thus knowledge capacity) while keeping active parameters comparable to much smaller dense models. DeepSeek-V3 has 671B total parameters but only 37B active, meaning it uses roughly the same compute per token as a 37B dense model while having access to 18x more stored knowledge. Both LLaMA 4 Maverick and Qwen 3.5 converge on the same design point: approximately 400B total parameters with only 17B active, suggesting this ratio has become a sweet spot for frontier MoE models.
Memory Requirements: From Bytes to Terabytes
Understanding how model size translates to memory requirements is essential for anyone who wants to run, deploy, or even just understand the infrastructure behind language models. The calculation is straightforward once you know the precision format.
Bytes Per Parameter
Every parameter is stored as a floating-point number. The precision format determines how many bytes each parameter occupies:
| Format | Bytes per Parameter | Bits | Typical Use |
|---|---|---|---|
| float32 (FP32) | 4 bytes | 32 bits | Training (legacy) |
| bfloat16 (BF16) | 2 bytes | 16 bits | Training (modern standard) |
| float16 (FP16) | 2 bytes | 16 bits | Inference |
| int8 (INT8) | 1 byte | 8 bits | Quantized inference |
| int4 (INT4) | 0.5 bytes | 4 bits | Aggressive quantization |
The formula for model weight memory is:
Memory (bytes) = number_of_parameters x bytes_per_parameterOr equivalently:
Memory (GB) = parameters_in_billions x bytes_per_parameterThis works because 1 billion bytes is approximately 1 GB (technically 1 GB = 1,073,741,824 bytes, but the approximation is close enough for practical purposes).
Memory Calculations for Real Models
Let’s compute the weight memory for several models at different precisions:
import numpy as np
models = [
("Mistral 7B", 7.3),
("LLaMA 3 8B", 8.0),
("LLaMA 3 70B", 70.6),
("LLaMA 4 Scout", 109.0),
("Qwen 3 235B-A22B", 235.0),
("LLaMA 4 Maverick", 400.0),
("Qwen 3.5 397B", 397.0),
("LLaMA 3.1 405B", 405.0),
("DeepSeek-V3", 671.0),
]
precisions = [
("FP32 (4B)", 4),
("FP16/BF16 (2B)", 2),
("INT8 (1B)", 1),
("INT4 (0.5B)", 0.5),
]
print(f"{'Model':<22}", end="")
for name, _ in precisions:
print(f" {name:>16}", end="")
print()
print("-" * 90)
for model_name, params_b in models:
print(f"{model_name:<22}", end="")
for prec_name, bytes_per in precisions:
mem_gb = params_b * bytes_per
if mem_gb >= 1000:
print(f" {mem_gb/1000:>13.1f} TB", end="")
else:
print(f" {mem_gb:>13.1f} GB", end="")
print()
print("\nNote: These are WEIGHT-ONLY memory requirements.")
print("Actual GPU memory usage is higher due to KV cache,")
print("activation memory, and framework overhead.")Some key observations from this table:
A 70B model in float16 requires approximately 140 GB of GPU memory just for the weights. This exceeds the memory of a single NVIDIA H100 GPU (80 GB), so the model must be split across at least 2 GPUs.
LLaMA 4 Maverick at float16 requires approximately 800 GB for weights alone. Even with 8x H100 GPUs (640 GB total), you would need quantization or more GPUs.
DeepSeek-V3 at float16 requires approximately 1.34 TB. This is why frontier MoE models require multi-node GPU clusters for deployment.
INT4 quantization reduces memory by 4x compared to float16, making it possible to run a 70B model on a single high-end GPU (about 35 GB for weights). This is why quantization (covered in Chapter 24) is so important for practical deployment.
Beyond Weights: Total Memory
The weight memory calculation above is only part of the story. During inference, the model also needs memory for:
KV cache (Chapter 18): Stores the key and value vectors for all previous tokens in the sequence. For a 70B model processing a 4,096-token sequence, the KV cache can add 5-10 GB or more.
Activation memory: Temporary storage for intermediate computations during the forward pass. This is typically smaller than the weights but can be significant for long sequences.
Framework overhead: PyTorch, CUDA, and other software consume GPU memory for internal bookkeeping. This is typically 1-3 GB.
A practical rule of thumb: for inference at float16 precision, budget approximately 2.5 GB per billion parameters (rather than the theoretical 2 GB) to account for KV cache and overhead. For training, the memory requirement is much higher (roughly 4-8x the weight memory) because you also need to store gradients, optimizer states, and activation checkpoints.
Source: Memory calculation rule of thumb from multiple sources including Modal (“approximately 2GB of GPU memory per 1B parameters in FP16”) and Spheron Network (“A Llama 3.1 70B model’s weights consume approximately 140 GB at FP16, but the total memory footprint in production can exceed 200 GB”).
How Size Relates to Capability
A natural question: does a bigger model always mean a better model? The short answer is “usually, but not always.” The relationship between model size and capability is real but nuanced.
The General Trend
Within the same model family and training approach, larger models consistently outperform smaller ones. LLaMA 3 70B is substantially more capable than LLaMA 3 8B across virtually every benchmark. LLaMA 3.1 405B outperforms LLaMA 3 70B. This is not surprising: more parameters means more capacity to store knowledge, recognize patterns, and perform complex reasoning.
But model size is only one factor. The other critical factors are:
Training data quantity and quality: A smaller model trained on more (or better) data can outperform a larger model trained on less data. The Chinchilla scaling laws (Chapter 13) formalize this relationship.
Architecture efficiency: MoE models achieve better performance per active parameter than dense models, because they can store more knowledge in their total parameters while keeping compute costs low. LLaMA 4 Maverick (17B active, 400B total) competes with models that have far more active parameters.
Training techniques: Reinforcement learning from human feedback (RLHF, Chapter 15), extended thinking (Chapter 16), and other post-training techniques can dramatically improve a model’s capabilities without changing its parameter count.
Distillation: Smaller models can be trained to mimic the behavior of larger models, transferring knowledge from a “teacher” to a “student.” This is why some small models punch above their weight class.
The Diminishing Returns Problem
Each doubling of model size produces smaller improvements than the previous doubling. Going from 1B to 8B parameters produces a dramatic improvement in capability. Going from 8B to 70B produces a significant but smaller improvement. Going from 70B to 405B produces a noticeable but even smaller improvement. This pattern of diminishing returns is one of the central challenges in scaling AI, and we will explore it in detail in Chapter 13.
Size Classes in Practice
As of March 2026, the industry has settled into rough size classes, each with different use cases:
| Size Class | Parameter Range | Typical Use | Example Models |
|---|---|---|---|
| Tiny | 0.5B - 3B | On-device, edge, mobile | Qwen 2.5 0.5B/1.5B/3B, Qwen 3 0.6B/1.7B |
| Small | 7B - 14B | Single-GPU inference, local use | Mistral 7B, LLaMA 3 8B, Qwen 2.5 14B, Qwen 3 8B/14B |
| Medium | 30B - 72B | Multi-GPU inference, enterprise | LLaMA 3 70B, Qwen 2.5 72B, Qwen 3 32B |
| Large | 100B - 400B | GPU cluster, API serving | LLaMA 4 Maverick, Qwen 3.5 397B, Mistral Large 2 |
| Frontier | 400B+ | Data center scale | LLaMA 3.1 405B, DeepSeek-V3, GPT-5 |
The “small” category (7B-14B) has become the sweet spot for local deployment. These models can run on a single consumer GPU with quantization, and modern 7B-8B models are remarkably capable for their size. The “medium” category (30B-72B) offers a significant step up in quality and is the standard for enterprise deployments. The “frontier” category represents the cutting edge, accessible primarily through cloud APIs due to the enormous hardware requirements.
Hands-On: Computing Model Sizes
Let’s build a general-purpose parameter counter that works for any Transformer model:
import numpy as np
def count_params(
vocab_size,
hidden_size,
num_layers,
num_q_heads,
num_kv_heads,
head_dim,
intermediate_size,
tie_embeddings=False,
num_experts=1,
num_experts_per_tok=1,
num_shared_experts=0,
moe_intermediate_size=None,
dense_layers=0,
):
"""Count total and active parameters for a Transformer model.
Supports both dense and MoE architectures.
"""
# Embedding
embed = vocab_size * hidden_size
output_proj = 0 if tie_embeddings else vocab_size * hidden_size
final_norm = hidden_size
# Per-layer attention (same for dense and MoE layers)
attn = (hidden_size * num_q_heads * head_dim +
hidden_size * num_kv_heads * head_dim +
hidden_size * num_kv_heads * head_dim +
num_q_heads * head_dim * hidden_size)
norm = 2 * hidden_size
# Dense FFN
dense_ffn = 3 * hidden_size * intermediate_size
# MoE FFN (per expert)
moe_inter = moe_intermediate_size or intermediate_size
expert_ffn = 3 * hidden_size * moe_inter
# Count layers
moe_layers = num_layers - dense_layers
# Total params
dense_layer_params = attn + dense_ffn + norm
# MoE layer: attention + norm + (num_experts routed + shared) experts + router
router_params = hidden_size * num_experts if num_experts > 1 else 0
shared_ffn = num_shared_experts * expert_ffn
moe_layer_params = attn + norm + num_experts * expert_ffn + shared_ffn + router_params
total = embed + output_proj + final_norm
total += dense_layers * dense_layer_params
total += moe_layers * (moe_layer_params if num_experts > 1 else dense_layer_params)
# Active params per token
active_expert_ffn = num_experts_per_tok * expert_ffn + shared_ffn
active_layer = attn + norm + (active_expert_ffn if num_experts > 1 else dense_ffn)
active_dense_layer = attn + dense_ffn + norm
active = embed + output_proj + final_norm
active += dense_layers * active_dense_layer
active += moe_layers * (active_layer if num_experts > 1 else active_dense_layer)
return total, active
# Real model configurations
configs = {
"LLaMA 3 8B": dict(
vocab_size=128_256, hidden_size=4096, num_layers=32,
num_q_heads=32, num_kv_heads=8, head_dim=128,
intermediate_size=14336,
),
"LLaMA 3 70B": dict(
vocab_size=128_256, hidden_size=8192, num_layers=80,
num_q_heads=64, num_kv_heads=8, head_dim=128,
intermediate_size=28672,
),
"LLaMA 3.1 405B": dict(
vocab_size=128_256, hidden_size=16384, num_layers=126,
num_q_heads=128, num_kv_heads=16, head_dim=128,
intermediate_size=53248,
),
"Qwen 2.5 72B": dict(
vocab_size=152_064, hidden_size=8192, num_layers=80,
num_q_heads=64, num_kv_heads=8, head_dim=128,
intermediate_size=29568,
),
"Qwen 2.5 7B": dict(
vocab_size=152_064, hidden_size=3584, num_layers=28,
num_q_heads=28, num_kv_heads=4, head_dim=128,
intermediate_size=18944,
),
"Mistral 7B": dict(
vocab_size=32_000, hidden_size=4096, num_layers=32,
num_q_heads=32, num_kv_heads=8, head_dim=128,
intermediate_size=14336,
),
"Mistral Large 2": dict(
vocab_size=32_768, hidden_size=12288, num_layers=88,
num_q_heads=96, num_kv_heads=8, head_dim=128,
intermediate_size=28672,
),
}
print(f"{'Model':<20} {'Total Params':>14} {'Active Params':>14}")
print("-" * 52)
for name, cfg in configs.items():
total, active = count_params(**cfg)
print(f"{name:<20} {total/1e9:>13.2f}B {active/1e9:>13.2f}B")
print("\n--- MoE Models ---")
# LLaMA 4 Maverick (alternating dense/MoE layers, interleave_moe_layer_step=2)
# 48 layers total: 24 dense, 24 MoE
total_mav, active_mav = count_params(
vocab_size=202_048, hidden_size=5120, num_layers=48,
num_q_heads=40, num_kv_heads=8, head_dim=128,
intermediate_size=16384, # dense layer FFN
num_experts=128, num_experts_per_tok=1,
num_shared_experts=1,
moe_intermediate_size=8192,
dense_layers=24, # every other layer is dense (interleave step = 2)
)
print(f"{'LLaMA 4 Maverick':<20} {total_mav/1e9:>13.2f}B {active_mav/1e9:>13.2f}B")
print(" (Approximate; actual alternating dense/MoE pattern per interleave_moe_layer_step=2)")
# Qwen 3 235B-A22B (all layers are MoE, decoder_sparse_step=1)
total_q3, active_q3 = count_params(
vocab_size=151_936, hidden_size=4096, num_layers=94,
num_q_heads=64, num_kv_heads=4, head_dim=128,
intermediate_size=12288, # dense FFN (not used; all layers are MoE)
num_experts=128, num_experts_per_tok=8,
moe_intermediate_size=1536,
dense_layers=0,
)
print(f"{'Qwen 3 235B-A22B':<20} {total_q3/1e9:>13.2f}B {active_q3/1e9:>13.2f}B")
print(f"\nMemory at FP16 (weights only):")
for name, cfg in configs.items():
total, _ = count_params(**cfg)
mem_gb = total * 2 / 1e9
print(f" {name:<20} {mem_gb:>8.1f} GB")This code computes parameter counts for real model configurations and shows the memory requirements. You can modify the configurations to explore how changing the hidden dimension, number of layers, or number of experts affects the total parameter count.
Visualizing the Model Size Landscape
Let’s create a visualization that shows how model sizes have evolved and how dense and MoE models compare:
import numpy as np
import matplotlib.pyplot as plt
# Model data: (name, total_params_B, active_params_B, year, is_moe)
models = [
("GPT-2\n(117M)", 0.117, 0.117, 2019, False),
("GPT-3\n(175B)", 175.0, 175.0, 2020, False),
("Mistral 7B", 7.3, 7.3, 2023, False),
("Mixtral\n8x7B", 46.7, 12.9, 2023, True),
("LLaMA 3\n8B", 8.0, 8.0, 2024, False),
("LLaMA 3\n70B", 70.6, 70.6, 2024, False),
("Mixtral\n8x22B", 141.0, 39.0, 2024, True),
("LLaMA 3.1\n405B", 405.0, 405.0, 2024, False),
("DeepSeek\n-V3", 671.0, 37.0, 2024, True),
("Qwen 3\n235B", 235.0, 22.0, 2025, True),
("LLaMA 4\nScout", 109.0, 17.0, 2025, True),
("LLaMA 4\nMaverick", 400.0, 17.0, 2025, True),
("Qwen 3.5\n397B", 397.0, 17.0, 2026, True),
]
fig, ax = plt.subplots(figsize=(14, 7))
for name, total, active, year, is_moe in models:
color = '#e74c3c' if is_moe else '#3498db'
marker = 'D' if is_moe else 'o'
# Plot total params
ax.scatter(year, total, s=120, c=color, marker=marker, zorder=5,
edgecolors='black', linewidth=0.5)
# For MoE models, also show active params with a connected line
if is_moe:
ax.scatter(year, active, s=60, c=color, marker=marker, alpha=0.4, zorder=4)
ax.plot([year, year], [active, total], color=color, alpha=0.4,
linewidth=1.5, linestyle='--')
# Label
offset_y = total * 0.15
ax.annotate(name, (year, total), textcoords="offset points",
xytext=(8, 5), fontsize=7, ha='left')
ax.set_yscale('log')
ax.set_xlabel('Release Year', fontsize=12)
ax.set_ylabel('Parameters (Billions, log scale)', fontsize=12)
ax.set_title('LLM Model Sizes: Dense vs. MoE (2019-2026)', fontsize=14)
ax.set_xlim(2018.5, 2026.8)
ax.set_ylim(0.05, 1500)
ax.grid(True, alpha=0.3, which='both')
# Legend
from matplotlib.lines import Line2D
legend_elements = [
Line2D([0], [0], marker='o', color='w', markerfacecolor='#3498db',
markersize=10, label='Dense (total = active)'),
Line2D([0], [0], marker='D', color='w', markerfacecolor='#e74c3c',
markersize=10, label='MoE (total params)'),
Line2D([0], [0], marker='D', color='w', markerfacecolor='#e74c3c',
markersize=8, alpha=0.4, label='MoE (active params)'),
]
ax.legend(handles=legend_elements, loc='upper left', fontsize=10)
plt.tight_layout()
plt.savefig('model_sizes.png', dpi=150, bbox_inches='tight')
plt.show()
print("Plot saved to model_sizes.png")This visualization reveals two important trends:
The MoE revolution: Starting in late 2023 with Mixtral, the industry shifted toward MoE architectures that decouple total parameters from active parameters. The gap between total and active parameters (shown by the dashed lines) has grown dramatically, with DeepSeek-V3 having 18x more total than active parameters.
The plateau in active parameters: While total parameter counts have continued to grow (from 175B for GPT-3 to 671B for DeepSeek-V3), the active parameter counts for MoE models have actually decreased. LLaMA 4 Maverick uses only 17B active parameters, less than LLaMA 3 8B’s total. The industry has learned that you can achieve frontier-level quality by having many experts (large total parameters) while keeping per-token compute low (small active parameters).
The Relationship Between Size and Hardware
Understanding model sizes is not just an academic exercise. It directly determines what hardware you need to run a model. Here is a practical guide:
Single Consumer GPU (16-24 GB VRAM)
With INT4 quantization, you can run models up to about 30-40B total parameters on a single consumer GPU like an NVIDIA RTX 4090 (24 GB VRAM):
- Mistral 7B at INT4: ~3.7 GB (fits easily)
- LLaMA 3 8B at INT4: ~4.0 GB (fits easily)
- LLaMA 3 70B at INT4: ~35 GB (does not fit on 24 GB)
Single Server GPU (80 GB VRAM)
An NVIDIA H100 with 80 GB VRAM can handle:
- LLaMA 3 70B at INT4: ~35 GB (fits)
- LLaMA 3 70B at FP16: ~140 GB (does not fit on one GPU)
- LLaMA 4 Scout at INT4: ~55 GB (fits)
Multi-GPU Server (2-8 GPUs)
With 2-8 H100 GPUs (160-640 GB total VRAM):
- LLaMA 3 70B at FP16: ~140 GB (2 GPUs)
- LLaMA 3.1 405B at INT4: ~200 GB (3-4 GPUs)
- LLaMA 4 Maverick at FP8: ~400 GB (5-6 GPUs)
Multi-Node Cluster
For the largest models:
- LLaMA 3.1 405B at FP16: ~810 GB (requires 2+ nodes of 8 GPUs each)
- DeepSeek-V3 at FP16: ~1.34 TB (requires multiple nodes)
These are rough estimates for weight memory only. Real deployments need additional memory for KV cache, activations, and framework overhead, so the actual GPU requirements are typically 20-50% higher than the weight-only calculation suggests.
Worked Example: Estimating Memory for a New Model
Suppose you encounter a new open-weight model with the following configuration:
hidden_size: 6,144
num_hidden_layers: 64
num_attention_heads: 48
num_key_value_heads: 8
head_dim: 128
intermediate_size: 21,504
vocab_size: 150,000
tie_word_embeddings: FalseLet’s estimate its total parameters and memory requirements step by step.
Step 1: Per-Layer Attention
W_Q: 6,144 x (48 x 128) = 6,144 x 6,144 = 37,748,736
W_K: 6,144 x (8 x 128) = 6,144 x 1,024 = 6,291,456
W_V: 6,144 x (8 x 128) = 6,144 x 1,024 = 6,291,456
W_O: (48 x 128) x 6,144 = 6,144 x 6,144 = 37,748,736
Total = 88,080,384 (88.1M)Step 2: Per-Layer FFN
W_gate: 6,144 x 21,504 = 132,120,576
W_up: 6,144 x 21,504 = 132,120,576
W_down: 21,504 x 6,144 = 132,120,576
Total = 396,361,728 (396.4M)Step 3: Per-Layer Total
Attention: 88,080,384
FFN: 396,361,728
RMSNorm: 12,288 (2 x 6,144)
Layer total: 484,454,400 (484.5M)Step 4: Full Model
64 layers: 31,005,081,600
Embedding: 921,600,000 (150,000 x 6,144)
Output proj: 921,600,000
Final RMSNorm: 6,144
Total: 32,848,287,744 (~32.8B)Step 5: Memory Requirements
FP32: 32.8B x 4 bytes = 131.4 GB
FP16: 32.8B x 2 bytes = 65.7 GB
INT8: 32.8B x 1 byte = 32.8 GB
INT4: 32.8B x 0.5 = 16.4 GBThis hypothetical 32.8B model would fit on a single H100 GPU at FP16 (65.7 GB < 80 GB), or on a consumer RTX 4090 at INT4 (16.4 GB < 24 GB). At FP32, it would require at least 2 GPUs.
# Verify our manual calculation
def estimate_model(hidden, layers, n_q, n_kv, hd, inter, vocab, tied=False):
attn = hidden * (n_q + 2 * n_kv) * hd + n_q * hd * hidden
ffn = 3 * hidden * inter
norm = 2 * hidden
per_layer = attn + ffn + norm
embed = vocab * hidden
out = 0 if tied else vocab * hidden
final = hidden
total = layers * per_layer + embed + out + final
return total
total = estimate_model(6144, 64, 48, 8, 128, 21504, 150_000)
print(f"Total parameters: {total:,} ({total/1e9:.2f}B)")
print(f"FP16 memory: {total * 2 / 1e9:.1f} GB")
print(f"INT4 memory: {total * 0.5 / 1e9:.1f} GB")The Evolution of Model Sizes: A Brief History
To appreciate where we are in March 2026, it helps to see how model sizes have evolved:
| Year | Milestone Model | Parameters | Key Innovation |
|---|---|---|---|
| 2017 | Original Transformer | ~63M | Attention is all you need |
| 2018 | BERT-Large | 340M | Bidirectional pre-training |
| 2019 | GPT-2 | 1.5B | Unsupervised multitask learning |
| 2020 | GPT-3 | 175B | Few-shot learning through scale |
| 2023 | LLaMA 2 70B | 70B | Open-weight frontier model |
| 2023 | Mixtral 8x7B | 46.7B (12.9B active) | Open-weight MoE |
| 2024 | LLaMA 3 8B/70B | 8B-70B | Improved data quality |
| 2024 | LLaMA 3.1 405B | 405B | Largest open dense model |
| 2024 | DeepSeek-V3 | 671B (37B active) | Efficient MoE at scale |
| 2025 | Qwen 3 235B-A22B | 235B (22B active) | Open MoE with hybrid thinking |
| 2025 | LLaMA 4 Maverick | 400B (17B active) | Natively multimodal MoE |
| 2026 | Qwen 3.5 397B-A17B | 397B (17B active) | Hybrid attention MoE (512 experts, linear + full attention) |
Sources: Original Transformer from Vaswani et al. (2017), ~63M parameters. BERT-Large from Devlin et al. (2018), 340M parameters. GPT-2 from Radford et al. (2019), 1.5B parameters. GPT-3 from Brown et al. (2020), 175B parameters, 96 layers, d_model=12,288. LLaMA 2 from Meta (July 2023). Mixtral 8x7B from Mistral AI (December 2023). LLaMA 3 from Meta (April 2024). LLaMA 3.1 405B from Meta (July 2024). DeepSeek-V3 from DeepSeek (December 26, 2024). Qwen 3 from Alibaba (April 29, 2025). LLaMA 4 from Meta (April 5, 2025). Qwen 3.5 from Alibaba (February 16, 2026).
The growth from 63 million to 671 billion parameters in seven years represents a roughly 10,000x increase. But the trend is not simply “make everything bigger.” The shift to MoE architectures means that the active parameter count (which determines per-token compute cost) has actually stabilized or even decreased, while total parameter count (which determines knowledge capacity) continues to grow. This is a fundamental shift in how the industry thinks about model size: the goal is no longer to maximize active parameters, but to maximize the ratio of knowledge to compute. By early 2026, multiple labs have independently converged on the same design point: roughly 400B total parameters with 17B active (LLaMA 4 Maverick and Qwen 3.5 both hit this target), suggesting this ratio represents a practical optimum for current hardware and training methods.
Key Takeaways
A parameter is a single number in a weight matrix. A model with 8 billion parameters has 8 billion individual floating-point numbers that were learned during training. These parameters are distributed across the embedding table, attention projections, FFN layers, normalization scales, and output projection.
The hidden dimension (hidden_size) is the most important architectural number. It determines the width of the token representation vector and directly affects the size of every weight matrix in the model. Parameter count scales roughly as the square of the hidden dimension. Production models range from hidden_size=768 (GPT-2 Small) to hidden_size=16,384 (LLaMA 3.1 405B).
As of March 2026, open-weight model sizes range from 0.5B (Qwen 2.5 0.5B) to 671B total parameters (DeepSeek-V3). Major open-weight families include LLaMA (Meta), Mistral/Mixtral (Mistral AI), DeepSeek (DeepSeek), and Qwen (Alibaba). The Qwen family has been particularly prolific, releasing Qwen 3 (April 2025) and Qwen 3.5 (February 2026) in rapid succession. Closed-source frontier models (GPT-5, Claude, Gemini) do not publish their parameter counts.
Major labs increasingly keep architecture details secret. OpenAI’s GPT-4 technical report (March 2023) explicitly stated it would not disclose architecture details, citing competitive and safety concerns. This trend has continued with GPT-5, Claude, and Gemini. Open-weight models remain the primary source of verified architecture information.
MoE models decouple total parameters from active parameters. LLaMA 4 Maverick has 400B total parameters but only 17B active per token. Qwen 3.5 independently converges on the same 397B/17B design point. DeepSeek-V3 has 671B total but only 37B active. This means MoE models need GPU memory for all parameters but only use a fraction of them for computation, achieving “big model quality at small model cost.”
Memory requirements follow a simple formula: parameters x bytes per parameter. A 70B model in float16 (2 bytes per parameter) requires approximately 140 GB for weights alone. INT4 quantization (0.5 bytes per parameter) reduces this to about 35 GB. Actual deployment memory is 20-50% higher due to KV cache, activations, and framework overhead.
The industry has settled into size classes: tiny (0.5B-3B) for on-device use, small (7B-14B) for single-GPU deployment, medium (30B-72B) for enterprise, large (100B-400B) for GPU clusters, and frontier (400B+) for data center scale.
Model size alone does not determine capability. Training data quality, architecture efficiency (MoE vs. dense), post-training techniques (RLHF, distillation), and inference-time compute (extended thinking) all play critical roles. A well-trained 7B model can outperform a poorly trained 70B model.
What’s Next
You now understand what model sizes mean in concrete terms: how parameters are counted, where they live, how they translate to memory requirements, and how the landscape of model sizes looks as of March 2026. In Chapter 12, we will dive deep into the Mixture-of-Experts architecture that has become the dominant design pattern for frontier models, explaining exactly how routing works, why MoE enables such dramatic efficiency gains, and how models like DeepSeek-V3 and LLaMA 4 Maverick achieve their remarkable balance of knowledge capacity and computational efficiency.