Chapter 11. Model Sizes and What They Mean

Part 4. Scaling, From Toy Models to Frontier

When someone says a model has “70 billion parameters,” what does that actually mean? How does that translate into real-world capabilities, memory requirements, and cost? In this chapter, we will break down exactly what parameters are, how they add up across the components you learned about in Chapters 5 through 10, and what the numbers look like for every major model family as of March 2026. We will also confront a reality of the current AI landscape: the biggest labs increasingly keep their architecture details secret, and we will explain why.

What Is a Parameter?

A parameter is a single number in a weight matrix. That is the entire definition. Every weight matrix you have seen in this book (embedding tables, attention projections, FFN layers, normalization scales) is made up of individual floating-point numbers, and each one of those numbers is a parameter.

When we say LLaMA 3 8B has “8 billion parameters,” we mean there are approximately 8 billion individual numbers stored in the model’s weight matrices. During training, each of these numbers is adjusted by gradient descent (Chapter 3) to minimize the loss function. During inference, these numbers are fixed: the model reads them from disk, loads them into GPU memory, and uses them to compute predictions.

Parameters are not code. They are not rules. They are just numbers, learned from data. The “intelligence” of a language model is entirely encoded in the specific values of its billions of parameters. Two models with identical architectures but different parameter values will behave completely differently, because the parameters determine what the model has learned.

Where Parameters Live

In Chapters 5 through 10, you learned about every component of a Transformer. Here is where the parameters are:

Component	What the parameters are	Typical count per layer
Embedding table (Ch. 5)	One vector per vocabulary token	vocab_size x hidden_size
Attention Q projection (Ch. 8)	Matrix mapping hidden state to queries	hidden_size x (num_q_heads x head_dim)
Attention K projection (Ch. 8)	Matrix mapping hidden state to keys	hidden_size x (num_kv_heads x head_dim)
Attention V projection (Ch. 8)	Matrix mapping hidden state to values	hidden_size x (num_kv_heads x head_dim)
Attention O projection (Ch. 8)	Matrix combining head outputs	(num_q_heads x head_dim) x hidden_size
FFN W_gate (Ch. 9)	Gate projection for SwiGLU	hidden_size x intermediate_size
FFN W_up (Ch. 9)	Up projection for SwiGLU	hidden_size x intermediate_size
FFN W_down (Ch. 9)	Down projection for SwiGLU	intermediate_size x hidden_size
RMSNorm gamma (Ch. 10)	Scale parameter, one per dimension	hidden_size (x2 per layer)
Output projection	Matrix mapping hidden state to vocabulary	hidden_size x vocab_size

The embedding table and output projection are model-wide (not per-layer). Everything else is repeated for every Transformer layer. A model with 32 layers has 32 copies of the attention weights, 32 copies of the FFN weights, and 64 RMSNorm layers (2 per Transformer block), plus one final RMSNorm before the output projection.

Counting Parameters: A Complete Walkthrough

Let’s count every parameter in LLaMA 3 8B, the model we have been using as our primary example throughout this book. This will make the concept of “8 billion parameters” completely concrete.

LLaMA 3 8B Architecture

From Meta’s release (April 18, 2024):

vocab_size:          128,256
hidden_size:         4,096
num_hidden_layers:   32
num_attention_heads: 32
num_key_value_heads: 8
head_dim:            128
intermediate_size:   14,336
tie_word_embeddings: False

Step 1: Embedding Table

The embedding table maps each token ID to a vector of size hidden_size:

128,256 x 4,096 = 525,336,576 parameters (525.3M)

Step 2: Per-Layer Attention Parameters

LLaMA 3 8B uses Grouped Query Attention (Chapter 8) with 32 query heads and 8 KV heads:

W_Q: 4,096 x (32 x 128) = 4,096 x 4,096 = 16,777,216
W_K: 4,096 x (8 x 128)  = 4,096 x 1,024 =  4,194,304
W_V: 4,096 x (8 x 128)  = 4,096 x 1,024 =  4,194,304
W_O: (32 x 128) x 4,096 = 4,096 x 4,096 = 16,777,216
                                    Total = 41,943,040 (42.0M)

Step 3: Per-Layer FFN Parameters

LLaMA 3 8B uses SwiGLU (Chapter 9) with three weight matrices:

W_gate: 4,096 x 14,336 = 58,720,256
W_up:   4,096 x 14,336 = 58,720,256
W_down: 14,336 x 4,096 = 58,720,256
                  Total = 176,160,768 (176.2M)

Step 4: Per-Layer RMSNorm Parameters

Two RMSNorm layers per block, each with hidden_size gamma parameters:

2 x 4,096 = 8,192 (0.008M)

Step 5: Per-Layer Total

Attention:  41,943,040
FFN:       176,160,768
RMSNorm:         8,192
Layer total: 218,112,000 (218.1M)

Step 6: All Layers

32 layers x 218,112,000 = 6,979,584,000 (6.98B)

Step 7: Model-Wide Components

Embedding table:    525,336,576
Final RMSNorm:            4,096
Output projection:  525,336,576  (separate from embedding, not tied)

Step 8: Grand Total

Layers:             6,979,584,000
Embedding:            525,336,576
Final RMSNorm:              4,096
Output projection:    525,336,576
                   ─────────────
Total:             8,030,261,248  (~8.03 billion)

That is where the “8B” comes from. Every single one of those 8.03 billion numbers was learned during training on trillions of tokens of text.

Source: LLaMA 3 8B architecture from Meta (April 18, 2024). Configuration from HuggingFace Transformers: vocab_size=128,256, hidden_size=4,096, intermediate_size=14,336, num_attention_heads=32, num_key_value_heads=8, head_dim=128, num_hidden_layers=32, tie_word_embeddings=False.

Parameter Distribution

Let’s visualize where those 8 billion parameters actually live:

import numpy as np

# LLaMA 3 8B parameter breakdown
vocab_size = 128_256
hidden_size = 4_096
num_layers = 32
num_q_heads = 32
num_kv_heads = 8
head_dim = 128
intermediate_size = 14_336

# Per-layer counts
attn_per_layer = (
    hidden_size * num_q_heads * head_dim +   # W_Q
    hidden_size * num_kv_heads * head_dim +   # W_K
    hidden_size * num_kv_heads * head_dim +   # W_V
    num_q_heads * head_dim * hidden_size      # W_O
)
ffn_per_layer = 3 * hidden_size * intermediate_size  # W_gate, W_up, W_down
norm_per_layer = 2 * hidden_size  # 2 RMSNorm layers

# Model-wide counts
embedding = vocab_size * hidden_size
output_proj = vocab_size * hidden_size  # not tied
final_norm = hidden_size

# Totals
total_layers = num_layers * (attn_per_layer + ffn_per_layer + norm_per_layer)
total = embedding + total_layers + final_norm + output_proj

print("LLaMA 3 8B Parameter Breakdown")
print("=" * 55)
print(f"{'Component':<30} {'Parameters':>12} {'Share':>8}")
print("-" * 55)
print(f"{'Embedding table':<30} {embedding:>12,} {embedding/total:>8.1%}")
print(f"{'Attention (32 layers)':<30} {num_layers*attn_per_layer:>12,} {num_layers*attn_per_layer/total:>8.1%}")
print(f"{'FFN (32 layers)':<30} {num_layers*ffn_per_layer:>12,} {num_layers*ffn_per_layer/total:>8.1%}")
print(f"{'RMSNorm (all)':<30} {num_layers*norm_per_layer+final_norm:>12,} {(num_layers*norm_per_layer+final_norm)/total:>8.1%}")
print(f"{'Output projection':<30} {output_proj:>12,} {output_proj/total:>8.1%}")
print("-" * 55)
print(f"{'TOTAL':<30} {total:>12,}")
print(f"\nThat is {total/1e9:.2f} billion parameters.")

When you run this, you will see that the FFN layers dominate (about 70% of total parameters), followed by the attention layers (about 17%), with the embedding and output projection accounting for the remaining 13%. The RMSNorm parameters are negligible (less than 0.01%).

The Hidden Dimension: The Most Important Number

If you had to pick a single number that defines a model’s “size class,” it would be the hidden dimension (also called hidden_size, d_model, or model dimension). This is the width of the vector that represents each token as it flows through the Transformer. Every component in the model is sized relative to this number:

The embedding table has hidden_size columns
The attention projections map to and from hidden_size
The FFN expands from hidden_size and contracts back to hidden_size
The output projection maps from hidden_size to the vocabulary

A larger hidden dimension means each token is represented by a longer vector with more dimensions, which gives the model more capacity to encode nuanced information about each token. It also means every weight matrix in the model is larger, which is why the hidden dimension has such a dramatic effect on total parameter count.

Hidden Dimensions in Production Models

Model	hidden_size	Category
GPT-2 Small	768	Tiny
Mistral 7B	4,096	Small
LLaMA 3 8B	4,096	Small
LLaMA 4 Scout / Maverick	5,120	Medium
DeepSeek-V3	7,168	Medium-Large
LLaMA 3 70B	8,192	Large
Qwen 2.5 72B	8,192	Large
GPT-3	12,288	Very Large
Mistral Large 2	12,288	Very Large
LLaMA 3.1 405B	16,384	Frontier

Sources: GPT-2 from Radford et al. (2019). Mistral 7B: hidden_size=4,096 (Mistral AI, September 2023). LLaMA 3 8B: hidden_size=4,096 (Meta, April 2024). LLaMA 4: hidden_size=5,120 (HuggingFace Transformers Llama4TextConfig). DeepSeek-V3: hidden_size=7,168 (arXiv:2412.19437). LLaMA 3 70B: hidden_size=8,192 (Meta, April 2024). Qwen 2.5 72B: hidden_size=8,192 (HuggingFace config.json). GPT-3: d_model=12,288 (Brown et al., 2020). Mistral Large 2: hidden_size=12,288 (Ollama model metadata, 88 layers, 96 attention heads). LLaMA 3.1 405B: hidden_size=16,384 (Meta, July 2024, config.json).

Why Parameters Scale Quadratically with Hidden Dimension

The total parameter count of a Transformer scales roughly as the square of the hidden dimension. Here is why:

The attention projections (W_Q, W_K, W_V, W_O) each have shape [hidden_size x something proportional to hidden_size]. For standard multi-head attention, W_Q and W_O are both [hidden_size x hidden_size], so each contributes hidden_size^2 parameters. The FFN matrices are [hidden_size x intermediate_size], and intermediate_size is typically 3 to 4 times hidden_size, so each FFN matrix contributes roughly 3.5 x hidden_size^2 parameters.

This means doubling the hidden dimension roughly quadruples the parameter count per layer. A model with hidden_size=8,192 has approximately 4x the parameters per layer as a model with hidden_size=4,096 (assuming the same number of heads and expansion ratio).

Let’s verify this with real numbers:

import numpy as np

def count_layer_params(hidden, n_q, n_kv, head_dim, inter):
    """Count parameters in one Transformer layer."""
    attn = (hidden * n_q * head_dim +       # W_Q
            hidden * n_kv * head_dim +       # W_K
            hidden * n_kv * head_dim +       # W_V
            n_q * head_dim * hidden)         # W_O
    ffn = 3 * hidden * inter                 # SwiGLU: W_gate, W_up, W_down
    norm = 2 * hidden                        # 2 RMSNorm
    return attn, ffn, norm

models = [
    ("LLaMA 3 8B",      4096,  32,  8, 128, 14336,  32),
    ("LLaMA 3 70B",     8192,  64,  8, 128, 28672,  80),
    ("LLaMA 3.1 405B", 16384, 128, 16, 128, 53248, 126),
]

print(f"{'Model':<18} {'hidden':>7} {'Layers':>7} {'Params/Layer':>14} {'Total Params':>14}")
print("-" * 65)
for name, h, nq, nkv, hd, inter, layers in models:
    attn, ffn, norm = count_layer_params(h, nq, nkv, hd, inter)
    per_layer = attn + ffn + norm
    # Approximate total (layers + embedding + output)
    vocab = 128_256
    total = layers * per_layer + 2 * vocab * h + h  # embed + output + final norm
    print(f"{name:<18} {h:>7,} {layers:>7} {per_layer:>14,} {total:>14,}")
    print(f"{'':18} {'':>7} {'':>7} ({per_layer/1e6:>8.1f}M)    ({total/1e9:>8.2f}B)")

This code shows how the per-layer parameter count grows dramatically with hidden dimension. LLaMA 3.1 405B has roughly 16x the parameters per layer compared to LLaMA 3 8B, driven primarily by the 4x larger hidden dimension (16,384 vs. 4,096) and the correspondingly larger FFN.

Real Model Sizes as of March 2026

The landscape of language models in March 2026 spans from tiny models that run on a phone to frontier models that require entire data centers. Here is a comprehensive survey of the major model families, organized by size class.

Open-Weight Models (Architecture Details Published)

These models have publicly available weights and documented architectures. We can verify every number.

Model	Total Params	Active Params	Layers	hidden_size	Architecture	Release
Mistral 7B	7.3B	7.3B (dense)	32	4,096	Dense Transformer	Sep 2023
LLaMA 3 8B	8.0B	8.0B (dense)	32	4,096	Dense Transformer	Apr 2024
Qwen 2.5 7B	7.6B	7.6B (dense)	28	3,584	Dense Transformer	Sep 2024
Mixtral 8x7B	46.7B	12.9B	32	4,096	MoE (8 experts, top-2)	Dec 2023
Qwen 2.5 72B	72.7B	72.7B (dense)	80	8,192	Dense Transformer	Sep 2024
LLaMA 3 70B	70.6B	70.6B (dense)	80	8,192	Dense Transformer	Apr 2024
Mistral Large 2	123B	123B (dense)	88	12,288	Dense Transformer	Jul 2024
Mixtral 8x22B	141B	39B	56	6,144	MoE (8 experts, top-2)	Apr 2024
LLaMA 4 Scout	109B	17B	48	5,120	MoE (16 experts, top-1)	Apr 2025
Qwen 3 235B-A22B	235B	22B	94	4,096	MoE (128 experts, top-8)	Apr 2025
LLaMA 4 Maverick	400B	17B	48	5,120	MoE (128 experts, top-1)	Apr 2025
LLaMA 3.1 405B	405B	405B (dense)	126	16,384	Dense Transformer	Jul 2024
DeepSeek-V3	671B	37B	61	7,168	MoE (256+1 experts, top-8)	Dec 2024
Qwen 3.5 397B-A17B	397B	17B	60	4,096	MoE (512 experts, top-10), hybrid attention	Feb 2026

Sources: Mistral 7B from Mistral AI (September 27, 2023): 32 layers, hidden_size=4,096, intermediate_size=14,336, vocab_size=32,000 (HuggingFace config.json, Mistral-7B-v0.1). LLaMA 3 8B from Meta (April 18, 2024). Qwen 2.5 7B from Alibaba (September 2024): 28 layers, hidden_size=3,584, intermediate_size=18,944, vocab_size=152,064, 28 query heads, 4 KV heads (HuggingFace config.json). Mixtral 8x7B from Mistral AI (December 2023): 46.7B total, 12.9B active, 8 experts with top-2 routing. Qwen 2.5 72B from Alibaba (September 2024): 80 layers, hidden_size=8,192, intermediate_size=29,568, vocab_size=152,064 (HuggingFace config.json). LLaMA 3 70B from Meta (April 2024): 80 layers, hidden_size=8,192, intermediate_size=28,672, 64 query heads, 8 KV heads. Mistral Large 2 from Mistral AI (July 24, 2024): 123B parameters, 88 layers, hidden_size=12,288, 96 attention heads, 8 KV heads, intermediate_size=28,672, vocab_size=32,768, 128K context (Ollama model metadata). Mixtral 8x22B from Mistral AI (April 2024): 141B total, 39B active. LLaMA 4 Scout/Maverick from Meta (April 5, 2025): 48 layers, hidden_size=5,120, 17B active. Qwen 3 235B-A22B from Alibaba (April 29, 2025): 94 layers, hidden_size=4,096, intermediate_size=12,288, moe_intermediate_size=1,536, vocab_size=151,936, 64 query heads, 4 KV heads, 128 experts, top-8 routing (HuggingFace config.json). LLaMA 3.1 405B from Meta (July 23, 2024): 126 layers, hidden_size=16,384, intermediate_size=53,248, 128 query heads, 16 KV heads. DeepSeek-V3 from DeepSeek (December 26, 2024): 61 layers, hidden_size=7,168, 671B total, 37B active (arXiv:2412.19437). Qwen 3.5 397B-A17B from Alibaba (February 16, 2026): 60 layers, hidden_size=4,096, moe_intermediate_size=1,024, vocab_size=248,320, 32 query heads, 2 KV heads, head_dim=256, 512 experts, top-10 routing, hybrid linear/full attention architecture, natively multimodal, Apache 2.0 license (HuggingFace config.json).

Closed-Source Models (Architecture Details Not Published)

These models are only accessible through APIs. The companies behind them do not publish architecture details, parameter counts, or training data composition.

Model	Estimated Size	What We Know	Release
GPT-4	~1.8T total (leaked)	Rumored MoE with 16 experts of ~111B each	Mar 2023
GPT-4o	Not disclosed	Unified multimodal Transformer, 128K context	May 2024
GPT-5	Not disclosed	400K context, three tiers (GPT-5, Mini, Nano)	Aug 2025
Claude Sonnet 4	Not disclosed	200K context (expanded to 1M via API, August 12, 2025), extended thinking	May 2025
Claude Sonnet 4.6	Not disclosed	1M context, near-Opus performance	Feb 2026
Gemini 2.5 Pro	Not disclosed	1M context, multimodal, advanced reasoning	Mar 2025
Grok 3	~3T (estimated)	1M context (marketed; practical API limit ~131K), MoE architecture, trained on 200K H100 GPUs	Feb 2025

The GPT-4 architecture details come from unverified leaks in mid-2023, which claimed approximately 1.8 trillion total parameters across 120 layers using a Mixture-of-Experts design with 16 experts of roughly 111 billion parameters each. OpenAI has never confirmed or denied these numbers. For GPT-5 and later models, OpenAI has not disclosed architecture details. Grok 3’s parameter count of approximately 3 trillion comes from Elon Musk’s statement to Ron Baron in November 2025, where he said “Grok-3 and -4 are based on a 3 trillion parameter model.” xAI has not published official architecture details. The marketed 1 million token context window appears to have a practical API ceiling of approximately 131,000 tokens, based on developer reports and API documentation.

Sources: GPT-4 leaked details from multiple reports (July 2023), unverified. GPT-4 technical report (arXiv:2303.08774, March 2023) explicitly states: “this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” GPT-4o released May 2024, 128K context. GPT-5 released August 7, 2025, per OpenAI; 400K context, three tiers (GPT-5, Mini, Nano). Claude Sonnet 4 released May 22, 2025, per Anthropic; 200K context (expanded to 1M on August 12, 2025, per Anthropic announcement). Claude Sonnet 4.6 released February 17, 2026, per Anthropic; 1M context. Gemini 2.5 Pro from Google DeepMind (experimental March 25, 2025; GA June 17, 2025), 1M context. Grok 3 from xAI (February 17, 2025), trained on Colossus supercluster with 200,000 H100 GPUs; ~3T parameter estimate per Elon Musk’s statement to Ron Baron (November 2025, reported by Benzinga and LifeArchitect.ai); 1M context marketed, practical API limit ~131K tokens per developer reports and Oracle API documentation.

Models Announced but Not Released

Model	Reported Size	Status
LLaMA 4 Behemoth	~2T total, 288B active, 16 experts	Effectively shelved; Meta shifted focus to “Avocado” proprietary model

Source: LLaMA 4 Behemoth announced by Meta in April 2025 with approximately 2 trillion total parameters and 288 billion active parameters across 16 experts. One source (Glenn Klockwood) describes it as “aborted before it was ever released due to poor performance.” Multiple reports from mid-2025 describe repeated delays (from summer to fall 2025 and beyond) due to performance falling short of expectations. As of March 2026, Meta has shifted focus to a proprietary model codenamed “Avocado,” which itself has been delayed from a planned March 2026 debut to at least May 2026 after internal benchmarks showed it trailing competitors from Google and OpenAI (per The New York Times, March 12, 2026, and Reuters).

Why Major Labs Don’t Publish Architecture Details

If you look at the table of closed-source models above, you will notice a pattern: none of the major commercial AI labs (OpenAI, Anthropic, Google DeepMind) publish the architecture details of their frontier models. This was not always the case. OpenAI published the full architecture of GPT-2 (2019) and GPT-3 (2020), including parameter counts, layer counts, hidden dimensions, and training details. Google published the original Transformer paper (2017) with complete architecture specifications.

The shift toward secrecy began around 2022-2023, driven by two factors:

Competitive pressure: As the commercial value of frontier models became clear, companies began treating architecture details as trade secrets. If a competitor knows your exact architecture, they can replicate it more easily. OpenAI’s GPT-4 technical report (March 2023) explicitly stated: “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”
Safety concerns: Some labs argue that publishing detailed architecture information makes it easier for bad actors to build dangerous systems. This argument is controversial; many researchers believe that openness enables better safety research.

The result is a two-tier landscape. Open-weight models (LLaMA, Mistral, DeepSeek, Qwen) publish their architectures and release their weights, allowing anyone to inspect, modify, and run them. Closed-source models (GPT-4/5, Claude, Gemini) are accessible only through APIs, and their internal details are unknown or based on unverified leaks.

For this book, we focus primarily on open-weight models when discussing specific architecture details, because those are the numbers we can verify. When we reference closed-source models, we clearly distinguish between confirmed facts (context window sizes, API pricing, benchmark scores) and unverified claims (parameter counts, architecture details).

Source: GPT-4 Technical Report, arXiv:2303.08774, March 2023. The quote about withholding architecture details appears on page 2 of the report.

Dense vs. MoE: Total Parameters vs. Active Parameters

One of the most important distinctions in the model size table above is between total parameters and active parameters. This distinction matters because of Mixture-of-Experts (MoE) architectures, which we will cover in depth in Chapter 12.

In a dense model like LLaMA 3 8B or LLaMA 3.1 405B, every parameter is used for every token. When the model processes a token, all 8 billion (or 405 billion) parameters participate in the computation. Total parameters equals active parameters.

In a MoE model like LLaMA 4 Maverick or DeepSeek-V3, only a fraction of the parameters are used for any given token. The model has many “expert” FFN blocks (Chapter 9), but a router selects only a few of them for each token. The rest sit idle.

Consider LLaMA 4 Maverick:

Total parameters: 400 billion (all the weights stored on disk and loaded into memory)
Active parameters: 17 billion (the weights actually used to process each token)

The 400B total includes 128 routed expert FFN blocks per MoE layer, but only 1 routed expert (plus 1 shared expert) is activated per token. The attention layers and shared experts are always active, contributing to the 17B active count.

This distinction has major practical implications:

Memory: You need enough GPU memory to hold all 400B parameters, even though only 17B are used per token. At float16 precision, that is approximately 800 GB just for the weights.
Compute: The computational cost per token is proportional to the active parameters (17B), not the total parameters (400B). This is why MoE models can achieve “big model quality at small model cost.”
Quality: The model’s knowledge capacity is related to the total parameters (400B), because different experts can store different knowledge. The model has access to all 400B parameters’ worth of knowledge, even though it only uses 17B parameters’ worth of computation per token.

import numpy as np

# Compare dense vs MoE models
models = [
    ("LLaMA 3 8B",       8.0,   8.0, "Dense"),
    ("LLaMA 3 70B",     70.6,  70.6, "Dense"),
    ("LLaMA 3.1 405B", 405.0, 405.0, "Dense"),
    ("Mixtral 8x7B",    46.7,  12.9, "MoE"),
    ("Mixtral 8x22B",  141.0,  39.0, "MoE"),
    ("LLaMA 4 Scout",  109.0,  17.0, "MoE"),
    ("Qwen 3 235B-A22B",235.0, 22.0, "MoE"),
    ("LLaMA 4 Maverick",400.0, 17.0, "MoE"),
    ("Qwen 3.5 397B",  397.0,  17.0, "MoE"),
    ("DeepSeek-V3",    671.0,  37.0, "MoE"),
]

print(f"{'Model':<22} {'Total':>8} {'Active':>8} {'Ratio':>8} {'Type':<6}")
print("-" * 58)
for name, total, active, arch in models:
    ratio = active / total
    print(f"{name:<22} {total:>7.1f}B {active:>7.1f}B {ratio:>7.1%} {arch:<6}")

print("\nKey insight: MoE models store far more knowledge (total params)")
print("while using similar compute per token (active params) as smaller dense models.")
print(f"\nDeepSeek-V3 has {671/37:.0f}x more total params than active params.")
print(f"LLaMA 4 Maverick has {400/17:.0f}x more total params than active params.")

The output reveals a striking pattern: MoE models achieve enormous total parameter counts (and thus knowledge capacity) while keeping active parameters comparable to much smaller dense models. DeepSeek-V3 has 671B total parameters but only 37B active, meaning it uses roughly the same compute per token as a 37B dense model while having access to 18x more stored knowledge. Both LLaMA 4 Maverick and Qwen 3.5 converge on the same design point: approximately 400B total parameters with only 17B active, suggesting this ratio has become a sweet spot for frontier MoE models.

Memory Requirements: From Bytes to Terabytes

Understanding how model size translates to memory requirements is essential for anyone who wants to run, deploy, or even just understand the infrastructure behind language models. The calculation is straightforward once you know the precision format.

Bytes Per Parameter

Every parameter is stored as a floating-point number. The precision format determines how many bytes each parameter occupies:

Format	Bytes per Parameter	Bits	Typical Use
float32 (FP32)	4 bytes	32 bits	Training (legacy)
bfloat16 (BF16)	2 bytes	16 bits	Training (modern standard)
float16 (FP16)	2 bytes	16 bits	Inference
int8 (INT8)	1 byte	8 bits	Quantized inference
int4 (INT4)	0.5 bytes	4 bits	Aggressive quantization

The formula for model weight memory is:

Memory (bytes) = number_of_parameters x bytes_per_parameter

Or equivalently:

Memory (GB) = parameters_in_billions x bytes_per_parameter

This works because 1 billion bytes is approximately 1 GB (technically 1 GB = 1,073,741,824 bytes, but the approximation is close enough for practical purposes).

Memory Calculations for Real Models

Let’s compute the weight memory for several models at different precisions:

import numpy as np

models = [
    ("Mistral 7B",        7.3),
    ("LLaMA 3 8B",        8.0),
    ("LLaMA 3 70B",      70.6),
    ("LLaMA 4 Scout",   109.0),
    ("Qwen 3 235B-A22B", 235.0),
    ("LLaMA 4 Maverick", 400.0),
    ("Qwen 3.5 397B",   397.0),
    ("LLaMA 3.1 405B",  405.0),
    ("DeepSeek-V3",      671.0),
]

precisions = [
    ("FP32 (4B)", 4),
    ("FP16/BF16 (2B)", 2),
    ("INT8 (1B)", 1),
    ("INT4 (0.5B)", 0.5),
]

print(f"{'Model':<22}", end="")
for name, _ in precisions:
    print(f" {name:>16}", end="")
print()
print("-" * 90)

for model_name, params_b in models:
    print(f"{model_name:<22}", end="")
    for prec_name, bytes_per in precisions:
        mem_gb = params_b * bytes_per
        if mem_gb >= 1000:
            print(f" {mem_gb/1000:>13.1f} TB", end="")
        else:
            print(f" {mem_gb:>13.1f} GB", end="")
    print()

print("\nNote: These are WEIGHT-ONLY memory requirements.")
print("Actual GPU memory usage is higher due to KV cache,")
print("activation memory, and framework overhead.")

Some key observations from this table:

A 70B model in float16 requires approximately 140 GB of GPU memory just for the weights. This exceeds the memory of a single NVIDIA H100 GPU (80 GB), so the model must be split across at least 2 GPUs.
LLaMA 4 Maverick at float16 requires approximately 800 GB for weights alone. Even with 8x H100 GPUs (640 GB total), you would need quantization or more GPUs.
DeepSeek-V3 at float16 requires approximately 1.34 TB. This is why frontier MoE models require multi-node GPU clusters for deployment.
INT4 quantization reduces memory by 4x compared to float16, making it possible to run a 70B model on a single high-end GPU (about 35 GB for weights). This is why quantization (covered in Chapter 24) is so important for practical deployment.

Beyond Weights: Total Memory

The weight memory calculation above is only part of the story. During inference, the model also needs memory for:

KV cache (Chapter 18): Stores the key and value vectors for all previous tokens in the sequence. For a 70B model processing a 4,096-token sequence, the KV cache can add 5-10 GB or more.
Activation memory: Temporary storage for intermediate computations during the forward pass. This is typically smaller than the weights but can be significant for long sequences.
Framework overhead: PyTorch, CUDA, and other software consume GPU memory for internal bookkeeping. This is typically 1-3 GB.

A practical rule of thumb: for inference at float16 precision, budget approximately 2.5 GB per billion parameters (rather than the theoretical 2 GB) to account for KV cache and overhead. For training, the memory requirement is much higher (roughly 4-8x the weight memory) because you also need to store gradients, optimizer states, and activation checkpoints.

Source: Memory calculation rule of thumb from multiple sources including Modal (“approximately 2GB of GPU memory per 1B parameters in FP16”) and Spheron Network (“A Llama 3.1 70B model’s weights consume approximately 140 GB at FP16, but the total memory footprint in production can exceed 200 GB”).

How Size Relates to Capability

A natural question: does a bigger model always mean a better model? The short answer is “usually, but not always.” The relationship between model size and capability is real but nuanced.

The General Trend

Within the same model family and training approach, larger models consistently outperform smaller ones. LLaMA 3 70B is substantially more capable than LLaMA 3 8B across virtually every benchmark. LLaMA 3.1 405B outperforms LLaMA 3 70B. This is not surprising: more parameters means more capacity to store knowledge, recognize patterns, and perform complex reasoning.

But model size is only one factor. The other critical factors are:

Training data quantity and quality: A smaller model trained on more (or better) data can outperform a larger model trained on less data. The Chinchilla scaling laws (Chapter 13) formalize this relationship.
Architecture efficiency: MoE models achieve better performance per active parameter than dense models, because they can store more knowledge in their total parameters while keeping compute costs low. LLaMA 4 Maverick (17B active, 400B total) competes with models that have far more active parameters.
Training techniques: Reinforcement learning from human feedback (RLHF, Chapter 15), extended thinking (Chapter 16), and other post-training techniques can dramatically improve a model’s capabilities without changing its parameter count.
Distillation: Smaller models can be trained to mimic the behavior of larger models, transferring knowledge from a “teacher” to a “student.” This is why some small models punch above their weight class.

The Diminishing Returns Problem

Each doubling of model size produces smaller improvements than the previous doubling. Going from 1B to 8B parameters produces a dramatic improvement in capability. Going from 8B to 70B produces a significant but smaller improvement. Going from 70B to 405B produces a noticeable but even smaller improvement. This pattern of diminishing returns is one of the central challenges in scaling AI, and we will explore it in detail in Chapter 13.

Size Classes in Practice

As of March 2026, the industry has settled into rough size classes, each with different use cases:

Size Class	Parameter Range	Typical Use	Example Models
Tiny	0.5B - 3B	On-device, edge, mobile	Qwen 2.5 0.5B/1.5B/3B, Qwen 3 0.6B/1.7B
Small	7B - 14B	Single-GPU inference, local use	Mistral 7B, LLaMA 3 8B, Qwen 2.5 14B, Qwen 3 8B/14B
Medium	30B - 72B	Multi-GPU inference, enterprise	LLaMA 3 70B, Qwen 2.5 72B, Qwen 3 32B
Large	100B - 400B	GPU cluster, API serving	LLaMA 4 Maverick, Qwen 3.5 397B, Mistral Large 2
Frontier	400B+	Data center scale	LLaMA 3.1 405B, DeepSeek-V3, GPT-5

The “small” category (7B-14B) has become the sweet spot for local deployment. These models can run on a single consumer GPU with quantization, and modern 7B-8B models are remarkably capable for their size. The “medium” category (30B-72B) offers a significant step up in quality and is the standard for enterprise deployments. The “frontier” category represents the cutting edge, accessible primarily through cloud APIs due to the enormous hardware requirements.

Hands-On: Computing Model Sizes

Let’s build a general-purpose parameter counter that works for any Transformer model:

import numpy as np

def count_params(
    vocab_size,
    hidden_size,
    num_layers,
    num_q_heads,
    num_kv_heads,
    head_dim,
    intermediate_size,
    tie_embeddings=False,
    num_experts=1,
    num_experts_per_tok=1,
    num_shared_experts=0,
    moe_intermediate_size=None,
    dense_layers=0,
):
    """Count total and active parameters for a Transformer model.
    
    Supports both dense and MoE architectures.
    """
    # Embedding
    embed = vocab_size * hidden_size
    output_proj = 0 if tie_embeddings else vocab_size * hidden_size
    final_norm = hidden_size

    # Per-layer attention (same for dense and MoE layers)
    attn = (hidden_size * num_q_heads * head_dim +
            hidden_size * num_kv_heads * head_dim +
            hidden_size * num_kv_heads * head_dim +
            num_q_heads * head_dim * hidden_size)
    norm = 2 * hidden_size

    # Dense FFN
    dense_ffn = 3 * hidden_size * intermediate_size

    # MoE FFN (per expert)
    moe_inter = moe_intermediate_size or intermediate_size
    expert_ffn = 3 * hidden_size * moe_inter

    # Count layers
    moe_layers = num_layers - dense_layers

    # Total params
    dense_layer_params = attn + dense_ffn + norm
    # MoE layer: attention + norm + (num_experts routed + shared) experts + router
    router_params = hidden_size * num_experts if num_experts > 1 else 0
    shared_ffn = num_shared_experts * expert_ffn
    moe_layer_params = attn + norm + num_experts * expert_ffn + shared_ffn + router_params

    total = embed + output_proj + final_norm
    total += dense_layers * dense_layer_params
    total += moe_layers * (moe_layer_params if num_experts > 1 else dense_layer_params)

    # Active params per token
    active_expert_ffn = num_experts_per_tok * expert_ffn + shared_ffn
    active_layer = attn + norm + (active_expert_ffn if num_experts > 1 else dense_ffn)
    active_dense_layer = attn + dense_ffn + norm

    active = embed + output_proj + final_norm
    active += dense_layers * active_dense_layer
    active += moe_layers * (active_layer if num_experts > 1 else active_dense_layer)

    return total, active


# Real model configurations
configs = {
    "LLaMA 3 8B": dict(
        vocab_size=128_256, hidden_size=4096, num_layers=32,
        num_q_heads=32, num_kv_heads=8, head_dim=128,
        intermediate_size=14336,
    ),
    "LLaMA 3 70B": dict(
        vocab_size=128_256, hidden_size=8192, num_layers=80,
        num_q_heads=64, num_kv_heads=8, head_dim=128,
        intermediate_size=28672,
    ),
    "LLaMA 3.1 405B": dict(
        vocab_size=128_256, hidden_size=16384, num_layers=126,
        num_q_heads=128, num_kv_heads=16, head_dim=128,
        intermediate_size=53248,
    ),
    "Qwen 2.5 72B": dict(
        vocab_size=152_064, hidden_size=8192, num_layers=80,
        num_q_heads=64, num_kv_heads=8, head_dim=128,
        intermediate_size=29568,
    ),
    "Qwen 2.5 7B": dict(
        vocab_size=152_064, hidden_size=3584, num_layers=28,
        num_q_heads=28, num_kv_heads=4, head_dim=128,
        intermediate_size=18944,
    ),
    "Mistral 7B": dict(
        vocab_size=32_000, hidden_size=4096, num_layers=32,
        num_q_heads=32, num_kv_heads=8, head_dim=128,
        intermediate_size=14336,
    ),
    "Mistral Large 2": dict(
        vocab_size=32_768, hidden_size=12288, num_layers=88,
        num_q_heads=96, num_kv_heads=8, head_dim=128,
        intermediate_size=28672,
    ),
}

print(f"{'Model':<20} {'Total Params':>14} {'Active Params':>14}")
print("-" * 52)
for name, cfg in configs.items():
    total, active = count_params(**cfg)
    print(f"{name:<20} {total/1e9:>13.2f}B {active/1e9:>13.2f}B")

print("\n--- MoE Models ---")
# LLaMA 4 Maverick (alternating dense/MoE layers, interleave_moe_layer_step=2)
# 48 layers total: 24 dense, 24 MoE
total_mav, active_mav = count_params(
    vocab_size=202_048, hidden_size=5120, num_layers=48,
    num_q_heads=40, num_kv_heads=8, head_dim=128,
    intermediate_size=16384,  # dense layer FFN
    num_experts=128, num_experts_per_tok=1,
    num_shared_experts=1,
    moe_intermediate_size=8192,
    dense_layers=24,  # every other layer is dense (interleave step = 2)
)
print(f"{'LLaMA 4 Maverick':<20} {total_mav/1e9:>13.2f}B {active_mav/1e9:>13.2f}B")
print("  (Approximate; actual alternating dense/MoE pattern per interleave_moe_layer_step=2)")

# Qwen 3 235B-A22B (all layers are MoE, decoder_sparse_step=1)
total_q3, active_q3 = count_params(
    vocab_size=151_936, hidden_size=4096, num_layers=94,
    num_q_heads=64, num_kv_heads=4, head_dim=128,
    intermediate_size=12288,  # dense FFN (not used; all layers are MoE)
    num_experts=128, num_experts_per_tok=8,
    moe_intermediate_size=1536,
    dense_layers=0,
)
print(f"{'Qwen 3 235B-A22B':<20} {total_q3/1e9:>13.2f}B {active_q3/1e9:>13.2f}B")

print(f"\nMemory at FP16 (weights only):")
for name, cfg in configs.items():
    total, _ = count_params(**cfg)
    mem_gb = total * 2 / 1e9
    print(f"  {name:<20} {mem_gb:>8.1f} GB")

This code computes parameter counts for real model configurations and shows the memory requirements. You can modify the configurations to explore how changing the hidden dimension, number of layers, or number of experts affects the total parameter count.

Visualizing the Model Size Landscape

Let’s create a visualization that shows how model sizes have evolved and how dense and MoE models compare:

import numpy as np
import matplotlib.pyplot as plt

# Model data: (name, total_params_B, active_params_B, year, is_moe)
models = [
    ("GPT-2\n(117M)",         0.117,   0.117, 2019, False),
    ("GPT-3\n(175B)",       175.0,   175.0, 2020, False),
    ("Mistral 7B",            7.3,     7.3, 2023, False),
    ("Mixtral\n8x7B",        46.7,    12.9, 2023, True),
    ("LLaMA 3\n8B",           8.0,     8.0, 2024, False),
    ("LLaMA 3\n70B",         70.6,    70.6, 2024, False),
    ("Mixtral\n8x22B",      141.0,    39.0, 2024, True),
    ("LLaMA 3.1\n405B",     405.0,   405.0, 2024, False),
    ("DeepSeek\n-V3",       671.0,    37.0, 2024, True),
    ("Qwen 3\n235B",        235.0,    22.0, 2025, True),
    ("LLaMA 4\nScout",      109.0,    17.0, 2025, True),
    ("LLaMA 4\nMaverick",   400.0,    17.0, 2025, True),
    ("Qwen 3.5\n397B",      397.0,    17.0, 2026, True),
]

fig, ax = plt.subplots(figsize=(14, 7))

for name, total, active, year, is_moe in models:
    color = '#e74c3c' if is_moe else '#3498db'
    marker = 'D' if is_moe else 'o'

    # Plot total params
    ax.scatter(year, total, s=120, c=color, marker=marker, zorder=5,
               edgecolors='black', linewidth=0.5)
    
    # For MoE models, also show active params with a connected line
    if is_moe:
        ax.scatter(year, active, s=60, c=color, marker=marker, alpha=0.4, zorder=4)
        ax.plot([year, year], [active, total], color=color, alpha=0.4,
                linewidth=1.5, linestyle='--')

    # Label
    offset_y = total * 0.15
    ax.annotate(name, (year, total), textcoords="offset points",
                xytext=(8, 5), fontsize=7, ha='left')

ax.set_yscale('log')
ax.set_xlabel('Release Year', fontsize=12)
ax.set_ylabel('Parameters (Billions, log scale)', fontsize=12)
ax.set_title('LLM Model Sizes: Dense vs. MoE (2019-2026)', fontsize=14)
ax.set_xlim(2018.5, 2026.8)
ax.set_ylim(0.05, 1500)
ax.grid(True, alpha=0.3, which='both')

# Legend
from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0], [0], marker='o', color='w', markerfacecolor='#3498db',
           markersize=10, label='Dense (total = active)'),
    Line2D([0], [0], marker='D', color='w', markerfacecolor='#e74c3c',
           markersize=10, label='MoE (total params)'),
    Line2D([0], [0], marker='D', color='w', markerfacecolor='#e74c3c',
           markersize=8, alpha=0.4, label='MoE (active params)'),
]
ax.legend(handles=legend_elements, loc='upper left', fontsize=10)

plt.tight_layout()
plt.savefig('model_sizes.png', dpi=150, bbox_inches='tight')
plt.show()
print("Plot saved to model_sizes.png")

This visualization reveals two important trends:

The MoE revolution: Starting in late 2023 with Mixtral, the industry shifted toward MoE architectures that decouple total parameters from active parameters. The gap between total and active parameters (shown by the dashed lines) has grown dramatically, with DeepSeek-V3 having 18x more total than active parameters.
The plateau in active parameters: While total parameter counts have continued to grow (from 175B for GPT-3 to 671B for DeepSeek-V3), the active parameter counts for MoE models have actually decreased. LLaMA 4 Maverick uses only 17B active parameters, less than LLaMA 3 8B’s total. The industry has learned that you can achieve frontier-level quality by having many experts (large total parameters) while keeping per-token compute low (small active parameters).

The Relationship Between Size and Hardware

Understanding model sizes is not just an academic exercise. It directly determines what hardware you need to run a model. Here is a practical guide:

Single Consumer GPU (16-24 GB VRAM)

With INT4 quantization, you can run models up to about 30-40B total parameters on a single consumer GPU like an NVIDIA RTX 4090 (24 GB VRAM):

Mistral 7B at INT4: ~3.7 GB (fits easily)
LLaMA 3 8B at INT4: ~4.0 GB (fits easily)
LLaMA 3 70B at INT4: ~35 GB (does not fit on 24 GB)

Single Server GPU (80 GB VRAM)

An NVIDIA H100 with 80 GB VRAM can handle:

LLaMA 3 70B at INT4: ~35 GB (fits)
LLaMA 3 70B at FP16: ~140 GB (does not fit on one GPU)
LLaMA 4 Scout at INT4: ~55 GB (fits)

Multi-GPU Server (2-8 GPUs)

With 2-8 H100 GPUs (160-640 GB total VRAM):

LLaMA 3 70B at FP16: ~140 GB (2 GPUs)
LLaMA 3.1 405B at INT4: ~200 GB (3-4 GPUs)
LLaMA 4 Maverick at FP8: ~400 GB (5-6 GPUs)

Multi-Node Cluster

For the largest models:

LLaMA 3.1 405B at FP16: ~810 GB (requires 2+ nodes of 8 GPUs each)
DeepSeek-V3 at FP16: ~1.34 TB (requires multiple nodes)

These are rough estimates for weight memory only. Real deployments need additional memory for KV cache, activations, and framework overhead, so the actual GPU requirements are typically 20-50% higher than the weight-only calculation suggests.

Worked Example: Estimating Memory for a New Model

Suppose you encounter a new open-weight model with the following configuration:

hidden_size: 6,144
num_hidden_layers: 64
num_attention_heads: 48
num_key_value_heads: 8
head_dim: 128
intermediate_size: 21,504
vocab_size: 150,000
tie_word_embeddings: False

Let’s estimate its total parameters and memory requirements step by step.

Step 1: Per-Layer Attention

W_Q: 6,144 x (48 x 128) = 6,144 x 6,144 = 37,748,736
W_K: 6,144 x (8 x 128)  = 6,144 x 1,024 =  6,291,456
W_V: 6,144 x (8 x 128)  = 6,144 x 1,024 =  6,291,456
W_O: (48 x 128) x 6,144 = 6,144 x 6,144 = 37,748,736
                                    Total = 88,080,384 (88.1M)

Step 2: Per-Layer FFN

W_gate: 6,144 x 21,504 = 132,120,576
W_up:   6,144 x 21,504 = 132,120,576
W_down: 21,504 x 6,144 = 132,120,576
                  Total = 396,361,728 (396.4M)

Step 3: Per-Layer Total

Attention:   88,080,384
FFN:        396,361,728
RMSNorm:         12,288  (2 x 6,144)
Layer total: 484,454,400 (484.5M)

Step 4: Full Model

64 layers:       31,005,081,600
Embedding:          921,600,000  (150,000 x 6,144)
Output proj:        921,600,000
Final RMSNorm:            6,144
Total:           32,848,287,744  (~32.8B)

Step 5: Memory Requirements

FP32:  32.8B x 4 bytes = 131.4 GB
FP16:  32.8B x 2 bytes =  65.7 GB
INT8:  32.8B x 1 byte  =  32.8 GB
INT4:  32.8B x 0.5     =  16.4 GB

This hypothetical 32.8B model would fit on a single H100 GPU at FP16 (65.7 GB < 80 GB), or on a consumer RTX 4090 at INT4 (16.4 GB < 24 GB). At FP32, it would require at least 2 GPUs.

# Verify our manual calculation
def estimate_model(hidden, layers, n_q, n_kv, hd, inter, vocab, tied=False):
    attn = hidden * (n_q + 2 * n_kv) * hd + n_q * hd * hidden
    ffn = 3 * hidden * inter
    norm = 2 * hidden
    per_layer = attn + ffn + norm
    
    embed = vocab * hidden
    out = 0 if tied else vocab * hidden
    final = hidden
    
    total = layers * per_layer + embed + out + final
    return total

total = estimate_model(6144, 64, 48, 8, 128, 21504, 150_000)
print(f"Total parameters: {total:,} ({total/1e9:.2f}B)")
print(f"FP16 memory: {total * 2 / 1e9:.1f} GB")
print(f"INT4 memory: {total * 0.5 / 1e9:.1f} GB")

The Evolution of Model Sizes: A Brief History

To appreciate where we are in March 2026, it helps to see how model sizes have evolved:

Year	Milestone Model	Parameters	Key Innovation
2017	Original Transformer	~63M	Attention is all you need
2018	BERT-Large	340M	Bidirectional pre-training
2019	GPT-2	1.5B	Unsupervised multitask learning
2020	GPT-3	175B	Few-shot learning through scale
2023	LLaMA 2 70B	70B	Open-weight frontier model
2023	Mixtral 8x7B	46.7B (12.9B active)	Open-weight MoE
2024	LLaMA 3 8B/70B	8B-70B	Improved data quality
2024	LLaMA 3.1 405B	405B	Largest open dense model
2024	DeepSeek-V3	671B (37B active)	Efficient MoE at scale
2025	Qwen 3 235B-A22B	235B (22B active)	Open MoE with hybrid thinking
2025	LLaMA 4 Maverick	400B (17B active)	Natively multimodal MoE
2026	Qwen 3.5 397B-A17B	397B (17B active)	Hybrid attention MoE (512 experts, linear + full attention)

Sources: Original Transformer from Vaswani et al. (2017), ~63M parameters. BERT-Large from Devlin et al. (2018), 340M parameters. GPT-2 from Radford et al. (2019), 1.5B parameters. GPT-3 from Brown et al. (2020), 175B parameters, 96 layers, d_model=12,288. LLaMA 2 from Meta (July 2023). Mixtral 8x7B from Mistral AI (December 2023). LLaMA 3 from Meta (April 2024). LLaMA 3.1 405B from Meta (July 2024). DeepSeek-V3 from DeepSeek (December 26, 2024). Qwen 3 from Alibaba (April 29, 2025). LLaMA 4 from Meta (April 5, 2025). Qwen 3.5 from Alibaba (February 16, 2026).

The growth from 63 million to 671 billion parameters in seven years represents a roughly 10,000x increase. But the trend is not simply “make everything bigger.” The shift to MoE architectures means that the active parameter count (which determines per-token compute cost) has actually stabilized or even decreased, while total parameter count (which determines knowledge capacity) continues to grow. This is a fundamental shift in how the industry thinks about model size: the goal is no longer to maximize active parameters, but to maximize the ratio of knowledge to compute. By early 2026, multiple labs have independently converged on the same design point: roughly 400B total parameters with 17B active (LLaMA 4 Maverick and Qwen 3.5 both hit this target), suggesting this ratio represents a practical optimum for current hardware and training methods.

Key Takeaways

A parameter is a single number in a weight matrix. A model with 8 billion parameters has 8 billion individual floating-point numbers that were learned during training. These parameters are distributed across the embedding table, attention projections, FFN layers, normalization scales, and output projection.
The hidden dimension (hidden_size) is the most important architectural number. It determines the width of the token representation vector and directly affects the size of every weight matrix in the model. Parameter count scales roughly as the square of the hidden dimension. Production models range from hidden_size=768 (GPT-2 Small) to hidden_size=16,384 (LLaMA 3.1 405B).
As of March 2026, open-weight model sizes range from 0.5B (Qwen 2.5 0.5B) to 671B total parameters (DeepSeek-V3). Major open-weight families include LLaMA (Meta), Mistral/Mixtral (Mistral AI), DeepSeek (DeepSeek), and Qwen (Alibaba). The Qwen family has been particularly prolific, releasing Qwen 3 (April 2025) and Qwen 3.5 (February 2026) in rapid succession. Closed-source frontier models (GPT-5, Claude, Gemini) do not publish their parameter counts.
Major labs increasingly keep architecture details secret. OpenAI’s GPT-4 technical report (March 2023) explicitly stated it would not disclose architecture details, citing competitive and safety concerns. This trend has continued with GPT-5, Claude, and Gemini. Open-weight models remain the primary source of verified architecture information.
MoE models decouple total parameters from active parameters. LLaMA 4 Maverick has 400B total parameters but only 17B active per token. Qwen 3.5 independently converges on the same 397B/17B design point. DeepSeek-V3 has 671B total but only 37B active. This means MoE models need GPU memory for all parameters but only use a fraction of them for computation, achieving “big model quality at small model cost.”
Memory requirements follow a simple formula: parameters x bytes per parameter. A 70B model in float16 (2 bytes per parameter) requires approximately 140 GB for weights alone. INT4 quantization (0.5 bytes per parameter) reduces this to about 35 GB. Actual deployment memory is 20-50% higher due to KV cache, activations, and framework overhead.
The industry has settled into size classes: tiny (0.5B-3B) for on-device use, small (7B-14B) for single-GPU deployment, medium (30B-72B) for enterprise, large (100B-400B) for GPU clusters, and frontier (400B+) for data center scale.
Model size alone does not determine capability. Training data quality, architecture efficiency (MoE vs. dense), post-training techniques (RLHF, distillation), and inference-time compute (extended thinking) all play critical roles. A well-trained 7B model can outperform a poorly trained 70B model.

What’s Next

You now understand what model sizes mean in concrete terms: how parameters are counted, where they live, how they translate to memory requirements, and how the landscape of model sizes looks as of March 2026. In Chapter 12, we will dive deep into the Mixture-of-Experts architecture that has become the dominant design pattern for frontier models, explaining exactly how routing works, why MoE enables such dramatic efficiency gains, and how models like DeepSeek-V3 and LLaMA 4 Maverick achieve their remarkable balance of knowledge capacity and computational efficiency.

Chapter 12. Mixture of Experts (MoE), The Dominant Architecture of 2026