Appendix E. Model Comparison Table (March 2026)

Choosing the right model in March 2026 means navigating a landscape of over two dozen frontier models from seven major providers, each with different architectures, context windows, pricing tiers, and licensing terms. This appendix organizes every frontier model referenced in this book into a single comparison table, with verified specifications as of March 20, 2026.

The table is designed for practical decision-making. If you need to pick a model for a specific use case, start here.

How to Read This Table

Each column captures a specific dimension of model capability:

Model: The official model name as used in API calls or documentation.
Provider: The company that created and serves the model.
Release Date: When the model became publicly available (API or weights).
Architecture: Dense or Mixture of Experts (MoE), with total and active parameter counts where known.
Context Window: Maximum input tokens supported via API. Some models have different limits for different tiers or require opt-in for extended context.
Max Output: Maximum tokens the model can generate in a single response.
API Pricing: Cost per million input and output tokens via the provider’s own API. Third-party providers (Groq, DeepInfra, Together AI) often offer lower prices for open-weight models.
License: Whether the model is closed (API-only), open-weight with a custom license, or open-weight under a standard permissive license (Apache 2.0, MIT).
Modalities: What input types the model accepts (text, images, audio, video) and what it can generate.
Reasoning: Whether the model supports extended thinking or chain-of-thought reasoning modes.
Chapter: Where this model is discussed in the book.

Pricing reflects the provider’s standard API rates. Cached input tokens, batch processing, and third-party hosting can reduce costs significantly (see Chapter 19 for caching and Chapter 24 for serving infrastructure).

E.1 Flagship Models (Highest Capability per Provider)

These are the most capable models from each major provider as of March 2026. Use these when you need the best possible performance and cost is secondary.

Model	Provider	Release	Architecture	Context	Max Output	Input $/MTok	Output $/MTok	License	Modalities	Reasoning	Ch
GPT-5.4	OpenAI	Mar 5, 2026	Undisclosed	1.05M (272K standard; 1.05M opt-in via API)	128K	$2.50	$15.00	Closed	Text, image in; text out	Yes (5 effort levels incl. xhigh)	11, 16, 20, 21, 23, 24
Claude Opus 4.6	Anthropic	Feb 5, 2026	Undisclosed	1M (GA since Mar 13, 2026)	128K	$5.00	$25.00	Closed	Text, image in; text out	Yes (adaptive: low/med/high/max)	11, 16, 17, 19, 20
Gemini 3.1 Pro	Google	Feb 19, 2026	Undisclosed	1M	64K	$2.00	$12.00	Closed	Text, image, audio, video in; text out	Yes (thinking levels)	20, 24
Grok 4	xAI	Jul 9, 2025	Undisclosed	256K	16K	$3.00	$15.00	Closed	Text, image in; text out	Yes (reasoning-only)	11, 16
Qwen 3-Max	Alibaba	Sep 2025 (preview); Jan 2026 (TTS upgrade)	MoE, 1T+ total	262K	32K	$1.20	$6.00	Closed (API only)	Text in; text out	Yes (test-time scaling)	13, 16
DeepSeek-V3.2	DeepSeek	Dec 1, 2025	MoE, 685B total / 37B active	128K	64K	$0.28	$0.42	MIT	Text in; text out	Yes (thinking mode)	11, 12, 18, 24
Kimi K2.5	Moonshot AI	Jan 26, 2026	MoE, 1.04T total / 32B active (384 experts, top-8)	256K	64K	$0.60	$3.00	MIT	Text, image in; text out	Yes (thinking mode)	N/A

Notes on flagship models:

GPT-5.4 is OpenAI’s most capable model as of March 2026. It merges the coding capabilities of GPT-5.3-Codex with the general reasoning of GPT-5.2 into a single model. The 1.05M context window is available via the API but requires opt-in; the standard context window is 272K tokens. Requests exceeding 272K are billed at 2x the normal input rate and 1.5x the output rate. GPT-5.4 also introduces native computer-use capabilities, scoring 75% on OSWorld-Verified (above the 72.4% human baseline), and tool search, which reduced total token usage by 47% on the MCP Atlas benchmark.

Claude Opus 4.6 is Anthropic’s flagship, released just three months after Opus 4.5. The 1M context window was initially in beta (requiring a specific API header), but Anthropic made it generally available on March 13, 2026, simultaneously removing the long-context surcharge. Requests of any length up to 1M tokens are now billed at the same per-token rate: $5/$25 per million tokens. Opus 4.6 achieves 80.8% on SWE-bench Verified (79.2% in Thinking mode per vals.ai).

Gemini 3.1 Pro is Google’s most capable Pro-tier model. The official DeepMind model card lists a 1M-token context window. Some third-party sources incorrectly claim 2M; the verified figure from the official model card is 1M. Pricing is $2/$12 per million tokens for contexts under 200K, with a surcharge of $4/$18 above 200K.

Grok 4 is xAI’s reasoning-first model. It has no non-reasoning mode. The 256K context window is confirmed in official developer documentation. Requests exceeding 128K tokens are billed at a higher extended-context rate. Grok 4 is available via SuperGrok subscriptions starting at $30/month or via the API.

Qwen 3-Max is Alibaba’s largest model, exceeding one trillion total parameters. It is available only via Alibaba Cloud Model Studio (API), not as downloadable weights. The January 2026 upgrade added test-time scaling (TTS), achieving 100% on AIME 2025. Pricing uses tiered rates based on input length: $1.20/$6.00 per million tokens for requests up to 32K input tokens, $2.40/$12.00 for 32K-128K, and $3.00/$15.00 for 128K-252K (International deployment; per official Alibaba Cloud pricing page, updated March 19, 2026).

DeepSeek-V3.2 is the most capable fully open-weight text-only model as of March 2026. It uses the same MoE architecture as DeepSeek-V3 (671B base) with continued pre-training, bringing the total to 685B parameters per the HuggingFace model card. The MIT license allows unrestricted commercial use. At $0.28/$0.42 per million tokens via the DeepSeek API, it is by far the cheapest frontier-class model. DeepSeek-V3.2 scores 70% on SWE-bench Verified and 94.2% on AIME 2026 (per telnyx.com); the V3.2-Speciale reasoning variant scores 73.1% on SWE-bench Verified and 96.0% on AIME 2025 (per beebom.com).

Kimi K2.5 is Moonshot AI’s flagship, released on January 26, 2026. It is the first open-weight native multimodal model at the trillion-parameter scale, built through continued pre-training on approximately 15 trillion mixed visual and text tokens atop the Kimi K2 base model. It uses a 384-expert MoE architecture with top-8 routing (plus one shared expert), activating approximately 32 billion parameters per token. Kimi K2.5 scored 76.8% on SWE-bench Verified at launch in standard mode, and 80.9% with Agent Swarm orchestration (per winbuzzer.com), making it the highest-scoring open-weight model on that benchmark. Its defining feature is Agent Swarm: the model can self-direct up to 100 sub-agents executing 1,500+ parallel tool calls. The MIT license allows unrestricted commercial use. Pricing via the Moonshot API is $0.60/$3.00 per million tokens (per costgoat.com citing official Moonshot platform pricing, updated February 2026); lower rates are available via third-party providers like Fireworks AI and DeepInfra. Cloudflare Workers AI also hosts Kimi K2.5 with a 256K context window.

E.2 Mid-Tier Models (Best Balance of Cost and Capability)

These models offer strong performance at significantly lower cost than the flagships. For most production workloads, one of these is the right choice.

Model	Provider	Release	Architecture	Context	Max Output	Input $/MTok	Output $/MTok	License	Modalities	Reasoning	Ch
GPT-5.4 mini	OpenAI	Mar 17, 2026	Undisclosed	400K	128K	$0.75	$4.50	Closed	Text, image in; text out	Yes	23, 24
Claude Sonnet 4.6	Anthropic	Feb 17, 2026	Undisclosed	1M (GA since Mar 13, 2026)	128K	$3.00	$15.00	Closed	Text, image in; text out	Yes (hybrid)	19
Gemini 3 Flash	Google	Dec 17, 2025	Undisclosed	1M	64K	$0.50	$3.00	Closed	Text, image, audio, video in; text out	Yes (thinking levels)	24
Grok 4 Fast / 4.1 Fast	xAI	Sep 19, 2025 / Nov 2025	Undisclosed	2M	16K	$0.20	$0.50	Closed	Text, image in; text out	Yes (reasoning and non-reasoning SKUs)	20
Grok 4.20 Beta	xAI	Feb 17, 2026 (Beta 2: Mar 3)	MoE, ~3T total (estimated)	2M	256K	$2.00	$6.00	Closed	Text, image in; text, image out	Yes (multi-agent reasoning)	N/A
Qwen 3.5 (397B/17B)	Alibaba	Feb 16, 2026	MoE, 397B total / 17B active	262K (1M via Qwen3.5-Plus hosted API)	32K	$0.60	$3.60	Apache 2.0	Text, image, video in; text out	Yes (hybrid thinking)	12, 17, 21, 22
DeepSeek-R1	DeepSeek	Jan 20, 2025	MoE, 671B total / 37B active	128K	64K	$0.50	$2.18	MIT	Text in; text out	Yes (chain-of-thought)	15, 16

Notes on mid-tier models:

GPT-5.4 mini runs over 2x faster than GPT-5 mini while approaching flagship-level accuracy. It supports native computer-use capabilities and scores 72.1% on OSWorld-Verified, just 2.9 points below the flagship GPT-5.4. On SWE-Bench Pro, it scores 54.4% versus the flagship’s 57.7%, a gap of 3.3 percentage points at 70% lower cost. The 400K context window is confirmed from the official OpenAI announcement.

Claude Sonnet 4.6 delivers what Anthropic describes as “Opus-level intelligence at Sonnet pricing.” It shares the 1M context window with Opus 4.6 (GA since March 13, 2026) and supports hybrid reasoning (combining instant responses with extended thinking). It scores 79.6% on SWE-bench Verified (some sources report 80.2%) and 72.5% on OSWorld-Verified for computer use, within 0.2% of Opus 4.6. In Claude Code testing, developers preferred Sonnet 4.6 over Sonnet 4.5 70% of the time and over Opus 4.5 59% of the time.

Gemini 3 Flash is the default model in the Gemini app and Google AI Mode in Search. At $0.50/$3.00 per million tokens with a 1M context window, it offers the best value among closed-source models for most use cases. It scores 90.4% on GPQA Diamond and 78% on SWE-bench Verified.

Grok 4 Fast was released in September 2025 with a 2M-token context window, the largest of any model in this table. Grok 4.1 Fast (November 2025) updated the model with improved tool-calling. Both are available at $0.20/$0.50 per million tokens for requests under 128K context, with tiered pricing above that threshold.

Grok 4.20 Beta is xAI’s newest flagship, launched on February 17, 2026, with a second beta iteration on March 3 and a “Beta 0309” reasoning variant on March 9-10. Its defining feature is a native 4-agent collaboration system: four specialized sub-agents (Grok, Harper, Benjamin, Lucas) reason in parallel and debate internally before delivering a unified response. It builds on a ~3T parameter MoE backbone with a 2M-token context window and 256K max output. xAI claims a 65% reduction in hallucinations over Grok 4.1, and Grok 4.20 achieved a 78% non-hallucination rate on the Artificial Analysis Omniscience test, the highest ever recorded by any AI model (per popularaitools.ai). However, on the Artificial Analysis Intelligence Index, Grok 4.20 Beta scores 48 with reasoning enabled, trailing Gemini 3.1 Pro and GPT-5.4 at 57 (per the-decoder.com). Pricing is $2.00/$6.00 per million tokens. Note: Grok 4.20 is still in beta as of March 20, 2026; the official xAI pricing page (docs.x.ai/docs/models) confirms Grok 4.20 as the newest flagship with 2M context but did not render full per-token rates in the table at the time of verification, so the $2/$6 figure is sourced from developer.puter.com and ai-primer.com.

Qwen 3.5 is the most capable open-weight model family as of March 2026. The flagship 397B/17B model uses a hybrid architecture combining Gated DeltaNet (linear attention) with standard attention, plus multi-token prediction. It supports 201 languages and is released under Apache 2.0. The hosted Qwen3.5-Plus API extends the context window to 1M tokens.

DeepSeek-R1 is the model that made open-source reasoning mainstream. Released in January 2025, it matches OpenAI o1 on many reasoning benchmarks while being fully open under the MIT license. The R1-0528 update (May 2025) further closed the gap with o3.

E.3 Budget Models (Lowest Cost for High-Volume Workloads)

When you need to process millions of requests per day, or when the task is simple enough that a smaller model suffices, these models offer the best economics.

Model	Provider	Release	Architecture	Context	Max Output	Input $/MTok	Output $/MTok	License	Modalities	Reasoning	Ch
GPT-5.4 nano	OpenAI	Mar 17, 2026	Undisclosed	128K	32K	$0.20	$1.25	Closed	Text, image in; text out	Limited	24
Claude Haiku 4.5	Anthropic	Oct 15, 2025	Undisclosed	200K	8K	$1.00	$5.00	Closed	Text, image in; text out	No	24
Gemini 3.1 Flash-Lite	Google	Mar 3, 2026	Undisclosed	1M	64K	$0.25	$1.50	Closed	Text, image, audio, video in; text out	No	24
GPT-OSS 120B	OpenAI	Aug 5, 2025	MoE, 117B total / 5.1B active (128 experts, top-4)	128K	32K	$0.15 (Groq)	$0.60 (Groq)	Apache 2.0	Text in; text out	Yes (chain-of-thought)	25
GPT-OSS 20B	OpenAI	Aug 5, 2025	MoE, 21B total / 3.6B active (32 experts, top-4)	128K	32K	$0.075 (Groq)	$0.30 (Groq)	Apache 2.0	Text in; text out	Yes (chain-of-thought)	25

Notes on budget models:

GPT-5.4 nano is OpenAI’s cheapest model in the GPT-5 family. At $0.20 per million input tokens, it undercuts Google’s Gemini 3.1 Flash-Lite on price. OpenAI recommends it for classification, data extraction, ranking, and coding subagents.

Claude Haiku 4.5 is Anthropic’s speed tier. At $1/$5 per million tokens, it is more expensive than the budget options from OpenAI and Google, but it remains the fastest Claude model for latency-sensitive applications.

Gemini 3.1 Flash-Lite launched on March 3, 2026, as Google’s most cost-efficient model. At $0.25/$1.50 per million tokens with a full 1M context window, it offers the largest context window of any budget-tier model. It is 2.5x faster than its predecessor (Gemini 2.5 Flash-Lite).

GPT-OSS models are OpenAI’s first open-weight releases since GPT-2 in 2019. Both use MoE architectures with Apache 2.0 licensing. The 120B model fits on a single 80 GB GPU (H100 or MI300X) and matches o3-mini on many benchmarks. The 20B model runs on devices with just 16 GB of RAM. Pricing shown is via Groq; self-hosting eliminates per-token costs entirely.

E.4 Open-Weight Models (Downloadable Weights)

These models can be downloaded, self-hosted, and fine-tuned. They are listed separately because their economics are fundamentally different: you pay for compute (GPU hours) rather than per-token API fees. For high-volume workloads, self-hosting can be dramatically cheaper than API access.

Model	Provider	Release	Architecture	Context	License	Modalities	Reasoning	Ch
LLaMA 4 Maverick	Meta	Apr 5, 2025	MoE, 400B total / 17B active (128 experts)	1M	Llama 4 Community License	Text, image in; text out	No	9, 11, 12, 21, 22
LLaMA 4 Scout	Meta	Apr 5, 2025	MoE, 109B total / 17B active (16 experts)	10M	Llama 4 Community License	Text, image in; text out	No	11, 12, 20
Qwen 3.5 (397B/17B)	Alibaba	Feb 16, 2026	MoE, 397B total / 17B active	262K	Apache 2.0	Text, image, video in; text out	Yes (hybrid thinking)	12, 17, 21, 22
Qwen 3.5 small series (0.8B to 35B)	Alibaba	Mar 2, 2026	Dense and MoE variants	262K	Apache 2.0	Text, image, video in; text out	Yes	21, 28
DeepSeek-V3.2	DeepSeek	Dec 1, 2025	MoE, 685B total / 37B active	128K	MIT	Text in; text out	Yes (thinking mode)	11, 12, 18, 24
DeepSeek-R1	DeepSeek	Jan 20, 2025	MoE, 671B total / 37B active	128K	MIT	Text in; text out	Yes (chain-of-thought)	15, 16
Mistral Small 4	Mistral AI	Mar 16, 2026	MoE, 119B total / 6B active (128 experts, top-4)	256K	Apache 2.0	Text, image in; text out	Yes (configurable reasoning_effort)	12, 21, 22
GPT-OSS 120B	OpenAI	Aug 5, 2025	MoE, 117B total / 5.1B active (128 experts, top-4)	128K	Apache 2.0	Text in; text out	Yes	25
GPT-OSS 20B	OpenAI	Aug 5, 2025	MoE, 21B total / 3.6B active (32 experts, top-4)	128K	Apache 2.0	Text in; text out	Yes	25
Kimi K2.5	Moonshot AI	Jan 26, 2026	MoE, 1.04T total / 32B active (384 experts, top-8)	256K	MIT	Text, image in; text out	Yes (thinking mode)	N/A

Notes on open-weight models:

LLaMA 4 Maverick and Scout are Meta’s first MoE models. Both use early fusion with a MetaCLIP vision encoder for native multimodal capabilities. Maverick (400B total, 128 experts) is the higher-capability variant; Scout (109B total, 16 experts) is designed for single-node deployment with a 10M-token context window. The Llama 4 Community License allows commercial use with no revenue restrictions but is not a standard open-source license (it includes specific acceptable use restrictions).

Third-party API pricing for LLaMA 4 Maverick varies by provider: $0.20/$0.60 via Groq, $0.50/$0.77 via Groq (at launch in April 2025), and lower rates via DeepInfra and Together AI. These prices change frequently; check provider pricing pages for current rates.

Qwen 3.5 is the most linguistically diverse open-weight model, supporting 201 languages. The flagship 397B/17B model introduces a hybrid architecture combining Gated DeltaNet (linear attention) with standard attention in a 3:1 ratio, plus multi-token prediction. The small series (0.8B to 35B) shares the same 262K context window and multimodal capabilities, making frontier-class features available on edge devices.

Mistral Small 4 is the newest open-weight model in this table, released on March 16, 2026. It unifies instruct, reasoning, and coding capabilities in a single model with configurable reasoning depth. The official blog lists 6B active parameters per token; the HuggingFace model card lists 6.5B (8B including embedding and output layers). It uses Multi-head Latent Attention (MLA), the same attention mechanism as DeepSeek-V3.

GPT-OSS models mark OpenAI’s return to open weights after six years. Both use SwiGLU activations, RMSNorm, Grouped-Query Attention with 8 KV heads, and RoPE for position encoding. The 120B model uses 36 layers with 128 experts (top-4 routing); the 20B model uses 24 layers with 32 experts (top-4 routing).

Kimi K2.5 is the largest open-weight model in this table by total parameter count (1.04 trillion). It adds native vision capabilities to the Kimi K2 base through a 400-million-parameter vision encoder called MoonViT, enabling multimodal tasks including replicating website user journeys from video demonstrations. The 384-expert MoE architecture activates 32 billion parameters per token (top-8 routing with one additional shared expert). At the NVIDIA GTC 2026 conference (March 18, 2026), Moonshot AI founder Yang Zhilin disclosed the core technology roadmap behind K2.5, including the KDA (Kimi Dense Attention) structure and Kimi Linear hybrid architecture that increases decoder speed 5-6x in 128K and 1M super-long context scenarios.

E.5 Reasoning Models (Specialized for Complex Problem-Solving)

These models are specifically designed or optimized for multi-step reasoning tasks: mathematics, coding, scientific analysis, and planning. They use extended thinking (generating internal reasoning tokens before producing a final answer) to improve accuracy on hard problems.

Model	Provider	Release	Context	Input $/MTok	Output $/MTok	Key Benchmark	License	Ch
o3	OpenAI	Apr 16, 2025	200K	$2.00	$8.00	96.7% AIME 2024, 87.7% GPQA Diamond	Closed	16
o4-mini	OpenAI	Apr 16, 2025	200K	$1.10	$4.40	20% better than o3-mini	Closed	16
GPT-5.4 (Thinking mode)	OpenAI	Mar 5, 2026	1.05M	$2.50	$15.00	75% OSWorld-Verified, 77.2% SWE-bench Verified	Closed	16, 23
Claude Opus 4.6 (Thinking)	Anthropic	Feb 5, 2026	1M (GA)	$5.00	$25.00	79.2% SWE-bench Verified (Thinking)	Closed	16
Gemini 2.5 Deep Think	Google	Aug 1, 2025	1M	Premium tier	Premium tier	87.6% LiveCodeBench, Bronze IMO 2025	Closed	16
DeepSeek-R1	DeepSeek	Jan 20, 2025	128K	$0.50	$2.18	79.8% AIME 2024 (pass@1)	MIT	15, 16
DeepSeek-R1-0528	DeepSeek	May 28, 2025	128K	$0.50	$2.18	Matches o3 and Gemini 2.5 Pro on reasoning	MIT	Appendix D
Qwen 3-Max (TTS)	Alibaba	Jan 2026 (TTS upgrade)	262K	$1.20	$6.00	100% AIME 2025	Closed (API)	13, 16

Notes on reasoning models:

The distinction between “reasoning model” and “general model with reasoning mode” is blurring. GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro all support configurable reasoning effort levels, allowing the same model to operate as either a fast general model or a deliberate reasoning model depending on the task. The dedicated reasoning models (o3, o4-mini, DeepSeek-R1) always use extended thinking.

o3 and o4-mini were released together on April 16, 2025. o3 is the more capable model; o4-mini is faster and cheaper. Both support a 200K context window. On June 10, 2025, OpenAI reduced o3 pricing by approximately 80% and released o3-pro for Pro users.

Thinking tokens (the internal reasoning the model generates before its final answer) are billed as output tokens at the standard output rate. This means reasoning models can be significantly more expensive per request than their headline pricing suggests, because a single response might generate thousands of thinking tokens before producing a short answer. Chapter 16 discusses this tradeoff in detail.

Note on SWE-bench Verified: In February 2026, OpenAI announced it would discontinue reporting results on SWE-bench Verified, citing significant contamination and test case flaws. OpenAI’s analysis found evidence that all major frontier models had been trained on benchmark solutions, and that 59% of failed test cases contained flaws (per blockchain.news, thenextgentechinsider.com). The benchmark scores listed in this appendix reflect the last reported figures and should be interpreted with this caveat in mind. Chapter 23 discusses the broader challenge of benchmark reliability.

E.6 Pricing Comparison: Input Tokens

This table ranks all models by input token cost, from cheapest to most expensive. It provides a quick reference for cost-sensitive decisions.

Rank	Model	Input $/MTok	Provider	Notes
1	GPT-OSS 20B	$0.075	OpenAI (via Groq)	Open-weight, self-hostable
2	GPT-OSS 120B	$0.15	OpenAI (via Groq)	Open-weight, self-hostable
3	Grok 4.1 Fast	$0.20	xAI	2M context, under 128K pricing
4	GPT-5.4 nano	$0.20	OpenAI	Cheapest GPT-5 family model
5	Gemini 3.1 Flash-Lite	$0.25	Google	1M context at budget pricing
6	DeepSeek-V3.2	$0.28	DeepSeek	Open-weight (MIT), self-hostable
7	Gemini 3 Flash	$0.50	Google	1M context, default Gemini model
8	DeepSeek-R1	$0.50	DeepSeek	Open-weight reasoning model
9	Kimi K2.5	$0.60	Moonshot AI	Open-weight (MIT), 1.04T MoE, self-hostable
10	Qwen 3.5 (397B/17B)	$0.60	Alibaba (via providers)	Open-weight (Apache 2.0)
11	GPT-5.4 mini	$0.75	OpenAI	400K context
12	o4-mini	$1.10	OpenAI	Reasoning model
13	Qwen 3-Max	$1.20	Alibaba	1T+ parameters, API only
14	Gemini 3.1 Pro	$2.00	Google	Under 200K context
14	Grok 4.20 Beta	$2.00	xAI	Beta; multi-agent; 2M context
15	o3	$2.00	OpenAI	Reasoning model
16	GPT-5.4	$2.50	OpenAI	Under 272K context
17	Grok 4	$3.00	xAI	Reasoning-only model
18	Claude Sonnet 4.6	$3.00	Anthropic	1M context (GA)
19	Claude Opus 4.6	$5.00	Anthropic	1M context (GA)

The price range spans 67x, from $0.075 (GPT-OSS 20B via Groq) to $5.00 (Claude Opus 4.6) per million input tokens. Open-weight models that can be self-hosted (DeepSeek-V3.2, Qwen 3.5, GPT-OSS, Mistral Small 4, Kimi K2.5) eliminate per-token costs entirely, replacing them with fixed GPU infrastructure costs. Chapter 24 covers the economics of self-hosting in detail. Note that Grok 4.20 Beta ($2.00/MTok) and Qwen 3-Max ($1.20/MTok base tier) use tiered pricing that increases with input length; the table shows the lowest tier for each.

E.7 Context Window Comparison

Context window size determines how much information a model can process in a single request. This table ranks models by maximum supported context window.

Rank	Model	Context Window	Notes
1	LLaMA 4 Scout	10,000,000	Open-weight; NIAH accuracy drops to 89% at 10M
2	Grok 4 Fast / 4.1 Fast	2,000,000	Tiered pricing above 128K
2	Grok 4.20 Beta	2,000,000	Beta; multi-agent architecture; 256K max output
3	GPT-5.4	1,050,000	Opt-in via API; 272K standard; surcharge above 272K
4	LLaMA 4 Maverick	1,000,000	Open-weight
5	Claude Opus 4.6	1,000,000	GA since March 13, 2026; standard pricing at all lengths
6	Claude Sonnet 4.6	1,000,000	GA since March 13, 2026; standard pricing at all lengths
7	Gemini 3.1 Pro	1,000,000	Per official DeepMind model card
8	Gemini 3 Flash	1,000,000	Default Gemini model
9	Gemini 3.1 Flash-Lite	1,000,000	Budget tier with full context
10	Qwen 3.5 (via Qwen3.5-Plus API)	1,000,000	Hosted API extends from 262K native
11	GPT-5.4 mini	400,000	Confirmed from official announcement
12	Qwen 3-Max	262,000	API only
13	Qwen 3.5 (open-weight)	262,000	Native context; 1M via hosted API
14	Mistral Small 4	256,000	Open-weight (Apache 2.0)
14	Kimi K2.5	256,000	Open-weight (MIT), native multimodal, 1.04T total
15	Grok 4	256,000	Extended-context pricing above 128K
16	Claude Haiku 4.5	200,000	Speed tier
17	o3 / o4-mini	200,000	Reasoning models
18	DeepSeek-V3.2	128,000	Open-weight (MIT)
19	DeepSeek-R1	128,000	Open-weight (MIT)
20	GPT-5.4 nano	128,000	Budget tier
21	GPT-OSS 120B / 20B	128,000	Open-weight (Apache 2.0)

The range spans 78x, from 128K tokens (DeepSeek, GPT-OSS, GPT-5.4 nano) to 10M tokens (LLaMA 4 Scout). However, raw context window size does not tell the full story. Chapter 20 discusses the “context rot” problem: model performance degrades on information placed in the middle of very long contexts. LLaMA 4 Scout’s NIAH (Needle-in-a-Haystack) accuracy drops from 95%+ at 8M tokens to 89% at 10M. GPT-5.4 shows accuracy degradation on high-complexity reasoning beyond 256K tokens. As of March 13, 2026, Anthropic made its 1M context window generally available for Opus 4.6 and Sonnet 4.6 at standard pricing, removing the previous beta header requirement and long-context surcharge entirely.

E.8 Architecture Comparison: Dense vs. MoE

As of March 2026, every major open-weight frontier model uses a Mixture of Experts architecture. This table highlights the architectural details of models where they are publicly known.

Model	Total Params	Active Params	Experts	Top-K	Architecture Notes
Qwen 3-Max	1T+	Undisclosed	MoE	Undisclosed	Largest known model by total parameters
Kimi K2.5	1.04T	32B	384	Top-8	MoonViT vision encoder, Agent Swarm, KDA architecture
DeepSeek-V3.2	685B	37B	256 routed + 1 shared	Top-2	MLA attention, DeepSeek Sparse Attention
DeepSeek-R1	671B	37B	256 routed + 1 shared	Top-2	Same base as V3, reasoning-tuned
LLaMA 4 Maverick	400B	17B	128	Top-1	Early fusion multimodal, MetaCLIP vision encoder
Qwen 3.5 (flagship)	397B	17B	MoE	Varies	Hybrid Gated DeltaNet + standard attention, multi-token prediction
Mistral Small 4	119B	6B (6.5B per HF)	128	Top-4	MLA attention, YaRN rope scaling
GPT-OSS 120B	117B	5.1B	128	Top-4	GQA with 8 KV heads, SwiGLU, RMSNorm
LLaMA 4 Scout	109B	17B	16	Top-1	Same architecture as Maverick, fewer experts
GPT-OSS 20B	21B	3.6B	32	Top-4	Runs on 16 GB RAM

Key patterns in March 2026 MoE architectures:

The “large total, small active” pattern is universal. Every open-weight frontier model activates between 4% and 16% of its total parameters per token. This gives models the knowledge capacity of a very large model (stored across all experts) with the inference speed of a much smaller one (only the active parameters are computed per token).

Three distinct MoE scales have emerged:

Compact MoE (100-120B total, 5-17B active): LLaMA 4 Scout, GPT-OSS 120B, Mistral Small 4. These fit on a single high-end GPU or a small multi-GPU setup.
Mid-range MoE (400-700B total, 17-37B active): LLaMA 4 Maverick, Qwen 3.5, DeepSeek-V3.2. These require multi-GPU setups but offer the best quality-to-cost ratio.
Frontier MoE (1T+ total): Kimi K2.5 (1.04T, 32B active, MIT, open-weight), Qwen 3-Max (1T+, API-only), Grok 3 (estimated ~3T, API-only). Kimi K2.5 is the only model in this tier available as downloadable weights.

Chapter 12 covers MoE architecture in detail, including routing mechanisms, load balancing, and the tradeoffs between expert count and active parameter count.

E.9 License Comparison

The licensing landscape for LLMs in March 2026 spans a wide spectrum, from fully closed APIs to permissive open-source licenses.

License Type	Models	Key Terms
Closed (API only)	GPT-5.4, GPT-5.4 mini/nano, o3, o4-mini, Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 3.1 Pro, Gemini 3 Flash, Gemini 3.1 Flash-Lite, Grok 4, Grok 4 Fast/4.1 Fast, Grok 4.20 Beta, Qwen 3-Max	No access to weights. Usage governed by provider terms of service.
Apache 2.0	Qwen 3.5 (all sizes), Mistral Small 4, GPT-OSS 120B, GPT-OSS 20B	Standard permissive license. Commercial use, modification, and redistribution allowed with attribution. No restrictions on downstream use.
MIT	DeepSeek-V3.2, DeepSeek-R1, Kimi K2.5	Most permissive standard license. Commercial use, modification, and redistribution allowed.
Llama 4 Community License	LLaMA 4 Maverick, LLaMA 4 Scout	Custom license from Meta. Commercial use allowed with no revenue restrictions. Includes acceptable use restrictions. Not a standard open-source license.

The trend is clear: open-weight models are converging on standard permissive licenses. In 2023, Meta’s LLaMA 2 used a custom license with a 700M monthly active user threshold. By 2025, OpenAI released GPT-OSS under Apache 2.0, and DeepSeek uses MIT for all its models. In January 2026, Moonshot AI released Kimi K2.5 under MIT, making it the largest open-weight model under a standard permissive license. The practical difference between Apache 2.0 and MIT is minimal for most users; both allow unrestricted commercial use.

E.10 Choosing the Right Model: Decision Framework

With so many options, here is a practical decision framework based on common use cases:

For general-purpose chat and content generation:

Budget: Gemini 3.1 Flash-Lite ($0.25/MTok) or GPT-5.4 nano ($0.20/MTok)
Balanced: Gemini 3 Flash ($0.50/MTok) or GPT-5.4 mini ($0.75/MTok)
Best quality: GPT-5.4 ($2.50/MTok) or Claude Opus 4.6 ($5.00/MTok)

For coding and software engineering:

Budget: DeepSeek-V3.2 ($0.28/MTok, self-hostable)
Balanced: Claude Sonnet 4.6 ($3.00/MTok) or GPT-5.4 mini ($0.75/MTok)
Best quality: Claude Opus 4.6 Thinking (79.2% SWE-bench) or GPT-5.4 (77.2% SWE-bench)

For complex reasoning (math, science, logic):

Budget: DeepSeek-R1 ($0.50/MTok, self-hostable)
Balanced: o4-mini ($1.10/MTok) or Qwen 3-Max ($1.20/MTok)
Best quality: o3 ($2.00/MTok) or GPT-5.4 Thinking ($2.50/MTok)

For long-context workloads (100K+ tokens):

Budget: Gemini 3.1 Flash-Lite ($0.25/MTok, 1M context)
Balanced: Grok 4.1 Fast ($0.20/MTok, 2M context) or Gemini 3 Flash ($0.50/MTok, 1M context)
Best quality: GPT-5.4 ($2.50/MTok, 1.05M) or Claude Opus 4.6 ($5.00/MTok, 1M GA)

For multimodal workloads (images, audio, video):

Budget: Gemini 3.1 Flash-Lite (text, image, audio, video input at $0.25/MTok)
Balanced: Gemini 3 Flash (same modalities at $0.50/MTok)
Best quality: Gemini 3.1 Pro ($2.00/MTok) or GPT-5.4 ($2.50/MTok, text + image only)
Open-weight: Qwen 3.5 (text, image, video) or LLaMA 4 Maverick (text, image)

For self-hosting and fine-tuning:

Smallest footprint: GPT-OSS 20B (16 GB RAM, Apache 2.0)
Best quality/cost: DeepSeek-V3.2 (685B total, MIT) or Qwen 3.5 397B (Apache 2.0)
Best for agentic workloads: Kimi K2.5 (1.04T total, MIT, Agent Swarm with 100+ parallel sub-agents)
Best for edge/mobile: Qwen 3.5 small series (0.8B to 9B, Apache 2.0)
Best for fine-tuning: Qwen3-8B or Qwen3.5-9B with LoRA (Chapter 28)

E.11 Key Takeaways

The price range spans 67x for input tokens alone, from $0.075 (GPT-OSS 20B via Groq) to $5.00 (Claude Opus 4.6) per million tokens. For most production workloads, the mid-tier and budget models offer sufficient quality at a fraction of the flagship cost.
Every major open-weight model uses MoE as of March 2026. The “total parameters / active parameters” ratio is the key metric: it determines both the model’s knowledge capacity and its inference cost. Typical ratios range from 10:1 (Mistral Small 4 at 119B/6B) to ~32:1 (Kimi K2.5 at 1.04T/32B).
Context windows have stratified into three tiers: budget models at 128-200K, standard models at 256K-1M, and long-context specialists at 1-10M. The 1M-token tier is now standard for flagship models from all major providers. As of March 13, 2026, Anthropic made its 1M context window generally available for Opus 4.6 and Sonnet 4.6 at standard pricing, with no surcharge at any length.
Open-weight models have reached parity with closed models on most benchmarks. DeepSeek-V3.2 matches GPT-5 performance at roughly 10x lower API cost. Kimi K2.5 scored 76.8% on SWE-bench Verified at launch (80.9% with Agent Swarm orchestration), the highest open-weight score on that benchmark. Qwen 3.5 claims to beat GPT-5.2 and Claude Opus 4.5 across 80% of benchmark categories. The gap between open and closed models is the narrowest it has ever been. Note that OpenAI discontinued SWE-bench Verified reporting in February 2026 due to contamination concerns (see E.5 notes), so benchmark comparisons should be interpreted with caution.
Reasoning capabilities are now table stakes. Every flagship model supports some form of extended thinking or configurable reasoning effort. The dedicated reasoning models (o3, DeepSeek-R1) still lead on the hardest benchmarks, but the gap is closing as general models add reasoning modes.
Multimodal input is standard; multimodal output is not. Every model in this table accepts text input. Most accept images. Google’s Gemini models uniquely accept audio and video natively. But only a few models (not listed in this table) can generate images or audio natively; see Chapter 22 for details.
Licensing has shifted toward permissive open-source. Apache 2.0 and MIT now cover models from Alibaba (Qwen 3.5), DeepSeek (V3.2, R1), Mistral (Small 4), Moonshot AI (Kimi K2.5), and even OpenAI (GPT-OSS). Meta’s Llama 4 Community License is the notable exception, using a custom license rather than a standard one.
Self-hosting economics favor open-weight MoE models. A model like Mistral Small 4 (119B total, 6B active, Apache 2.0) or GPT-OSS 120B (117B total, 5.1B active, Apache 2.0) can run on a single H100 GPU. At cloud GPU rates of $1.25-$3.00/hour, self-hosting becomes cheaper than API access at roughly 1-5 million tokens per hour of usage. Chapter 24 and Appendix B cover the hardware requirements in detail.
Agentic capabilities are now a key differentiator for open-weight models. Kimi K2.5 (January 2026) introduced Agent Swarm, enabling a single model to self-direct up to 100 sub-agents executing 1,500+ parallel tool calls. This represents a shift from models that merely support tool calling to models that natively orchestrate complex multi-agent workflows. Combined with its MIT license and 1.04-trillion-parameter scale, Kimi K2.5 demonstrates that open-weight models can compete with closed models not just on benchmarks but on agentic architecture.
Multi-agent architectures are emerging at the model level. Grok 4.20 Beta (February 2026) introduced a native 4-agent collaboration system where specialized sub-agents reason in parallel and debate before producing a response. This is distinct from application-level multi-agent frameworks (Chapter 23, Chapter 29); the collaboration happens inside the model itself. Whether this approach delivers consistent improvements over single-model reasoning remains to be validated as the model exits beta.

Appendix F provides further reading for staying current as new models are released after this book’s publication date.

Sources: All specifications, pricing, and release dates in this appendix are verified via web search as of March 20, 2026, and cross-referenced with the source citations in each chapter. Key primary sources include: OpenAI official pricing page (openai.com/api/pricing) and GPT-5.4 deep dive (community.openai.com/t/gpt-5-4-deep-dive-pricing-context-limits-and-tool-search-explained/1375800) confirming $2.50/$15.00 per million tokens, 1.05M context window, 272K standard context, 2x/1.5x surcharge above 272K. OpenAI GPT-5.4 mini and nano announcement (openai.com/index/introducing-gpt-5-4-mini-and-nano) confirming mini at $0.75/$4.50 with 400K context, 72.1% OSWorld-Verified, 54.4% SWE-Bench Pro, and nano at $0.20/$1.25; mini pricing also confirmed by pulse24.ai, implicator.ai, and buildfastwithai.com; mini OSWorld and SWE-Bench Pro scores confirmed by awesomeagents.ai and innovation-village.com. OpenAI GPT-5 launch (openai.com/gpt-5) confirming $1.25/$10.00 per million tokens, 400K context, August 7, 2025 release. OpenAI o3 and o4-mini announcement (openai.com/index/introducing-o3-and-o4-mini) confirming o3 at $2.00/$8.00 and o4-mini at $1.10/$4.40, both with 200K context, April 16, 2025 release; pricing confirmed by simonwillison.net and langcopilot.com. OpenAI GPT-OSS announcement (openai.com/index/introducing-gpt-oss) confirming 120B (117B total, 5.1B active, 128 experts) and 20B (21B total, 3.6B active, 32 experts), both Apache 2.0, 128K context, August 5, 2025 release; architecture details confirmed from arxiv.org/html/2508.12461v1 and cometapi.com. OpenAI SWE-bench Verified discontinuation (openai.com/index/why-we-no-longer-evaluate-swe-bench-verified, blockchain.news, thenextgentechinsider.com) citing contamination and 59% flawed test cases, February 2026. Groq pricing page (groq.com/pricing) confirming LLaMA 4 Maverick at $0.20/$0.60, LLaMA 4 Scout at $0.11/$0.34, GPT-OSS 120B at $0.15/$0.60, GPT-OSS 20B at $0.075/$0.30 per million tokens. Anthropic Claude Opus 4.6 announcement (anthropic.com/research/claude-opus-4-6) confirming February 5, 2026 release, $5/$25 per million tokens, 128K max output, 80.8% SWE-bench Verified; pricing confirmed by curlscape.com, karangoyal.cc, gaga.art. Anthropic 1M context GA announcement (claude.com/blog/1m-context-ga) confirming March 13, 2026 general availability of 1M context for Opus 4.6 and Sonnet 4.6 at standard pricing with no surcharge at any length; confirmed by blockchain.news, the-decoder.com, karangoyal.cc, thenextgentechinsider.com, cursor.com. Claude Sonnet 4.6 released February 17, 2026 at $3/$15 with 79.6% SWE-bench Verified (some sources report 80.2% per bytebot.io) and 72.5% OSWorld-Verified (digitalapplied.com, nxcode.io, bytebot.io, claudefa.st, businessworld.in); developers preferred Sonnet 4.6 over Sonnet 4.5 70% of the time and over Opus 4.5 59% of the time (bytebot.io). Claude Haiku 4.5 at $1/$5 released October 15, 2025 (curlscape.com, pecollective.com, siliconrepublic.com, anthropic.com/news/claude-haiku-4-5). Google DeepMind Gemini 3.1 Pro model card (deepmind.google/models/model-cards/gemini-3-1-pro) confirming 1M context window and 64K token output; pricing $2/$12 confirmed by felloai.com, digitalapplied.com, thedeepview.com, theneuron.ai; max output 65,536 tokens (64K) confirmed by aifreeapi.com, apidog.com, replicate.com. Gemini 3 Flash at $0.50/$3.00 with 1M context and 64K output (langcopilot.com, spectrumailab.com, digitalapplied.com, aifreeapi.com); 90.4% GPQA Diamond and 78% SWE-bench Verified (spectrumailab.com, introl.com, gaga.art). Gemini 3.1 Flash-Lite at $0.25/$1.50 with 1M context, March 3, 2026 release (verdent.ai, venkatsoftware.com, launchberg.com, futunn.com, gaga.art, iweaver.ai). xAI Grok 4 at $3/$15 with 256K context in API, 128K in app (aifreeapi.com, costgoat.com, datastudios.org, apidog.com, thenextgentechinsider.com). Grok 4 Fast at $0.20/$0.50 with 2M context (x.ai/news/grok-4-fast, natural20.com, testingcatalog.com, langcopilot.com). Grok 4.1 Fast announcement (x.ai/news/grok-4-1-fast) confirming 2M context and agent tools API; pricing confirmed by ainvest.com, mem0.ai, costgoat.com. Grok 4.20 Beta launched February 17, 2026, Beta 2 on March 3, Beta 0309 reasoning variant March 9-10, with native 4-agent collaboration system, ~3T MoE backbone, 2M context, 256K max output, 65% hallucination reduction over Grok 4.1, 78% non-hallucination rate on Artificial Analysis Omniscience test (popularaitools.ai), Intelligence Index score 48 vs GPT-5.4/Gemini 3.1 Pro at 57 (the-decoder.com); pricing $2/$6 per million tokens (developer.puter.com, ai-primer.com, thenextgentechinsider.com, aibase.com); official xAI pricing page (docs.x.ai/docs/models) confirms Grok 4.20 as newest flagship with 2M context but did not render per-token rates in table at time of verification. DeepSeek API pricing March 2026: V3.2 at $0.28/$0.42, R1 at $0.50/$2.18 (tldl.io/resources/deepseek-api-pricing, deepseak.org). DeepSeek-V3.2 model card (huggingface.co/deepseek-ai/DeepSeek-V3.2) confirming 685B total parameters, 37B active, MIT license, 128K context; SWE-bench Verified 70% and AIME 2026 94.2% (telnyx.com); V3.2-Speciale variant 73.1% SWE-bench Verified and 96.0% AIME 2025 (beebom.com). DeepSeek-R1 paper (arxiv.org/abs/2501.12948) confirming 671B total, 37B active, January 20, 2025 release. Meta LLaMA 4 release (huggingface.co/blog/llama4-release) confirming Maverick 400B/17B/128 experts, Scout 109B/17B/16 experts, April 5, 2025 release, Llama 4 Community License. Llama 4 Community License text (ollama.com/library/llama4/blobs/399a8a5a36db). Alibaba Cloud Model Studio official pricing page (alibabacloud.com/help/en/model-studio/billing-for-model-studio, updated March 19, 2026) confirming qwen3-max tiered pricing: $1.20/$6.00 for 0-32K input tokens, $2.40/$12.00 for 32K-128K, $3.00/$15.00 for 128K-252K (International deployment). Alibaba Qwen 3.5 announcement (qwen-ai.com) confirming 397B/17B, February 16, 2026 release, Apache 2.0, 262K native context, 201 languages; pricing $0.60/$3.60 via third-party providers (anotherwrapper.com). Qwen 3-Max at $1.20/$6.00 base tier with 262K context (galaxy.ai/model/qwen3-max, qwen-ai.com, alibabacloud.com). Mistral Small 4 announcement (mistral.ai/news/mistral-small-4) confirming 119B total, 6B active, 128 experts, top-4, 256K context, Apache 2.0, March 16, 2026 release; HuggingFace model card (huggingface.co/mistralai/Mistral-Small-4-119B-2603) listing 6.5B active. Moonshot AI Kimi K2.5 released January 26-27, 2026, 1.04T total parameters (confirmed from huggingface.co/blog/mlabonne/kimik25 and ai-rockstars.com), 32B active, 384 experts with top-8 routing plus one shared expert (confirmed from huggingface.co/blog/mlabonne/kimik25: ‘384 experts with 8 activated per token’, scalebytech.com, ycombinator.com/item?id=44639828), 256K context, MIT license (modelslab.com, a2aprotocol.ai, cloudflare.com/workers-ai-large-models, wikipedia.org/wiki/Moonshot_AI, deeplearning.ai, llm-stats.com); 76.8% SWE-bench Verified at launch in standard mode (modelslab.com, recapio.com); 80.9% with Agent Swarm orchestration (winbuzzer.com); Agent Swarm with 100+ sub-agents and 1,500+ parallel tool calls (a2aprotocol.ai); MoonViT 400M-parameter vision encoder for native multimodal (wikipedia.org/wiki/Moonshot_AI); continued pre-training on ~15T mixed visual and text tokens (startuphub.ai, softmaxdata.com); pricing $0.60/$3.00 via Moonshot API (costgoat.com/pricing/kimi-api citing official Moonshot platform pricing, updated February 2026; codecademy.com reports $0.60/$2.50 for output tokens via some providers); KDA architecture and Kimi Linear hybrid disclosed at NVIDIA GTC 2026 March 18 (1ai.net, aibase.com); Cloudflare Workers AI hosting with 256K context (blog.cloudflare.com/workers-ai-large-models). Claude Opus 4.5 at $5/$25 with 200K context, November 24, 2025 release (anthropic.com/news/claude-opus-4-5). Gemini 2.5 Deep Think August 1, 2025, 87.6% LiveCodeBench (9to5google.com, neowin.net). GPT-5.4 SWE-bench Verified 77.20% (vals.ai/benchmarks/swebench, March 2026). All benchmark scores cross-referenced with vals.ai/benchmarks where available (vals.ai/benchmarks/swebench confirming Claude Opus 4.6 Thinking 79.20%, Claude Opus 4.6 80.8%, GPT-5.4 77.20%, Gemini 3 Flash 76.20% on SWE-bench Verified as of March 2026); note that OpenAI discontinued SWE-bench Verified reporting in February 2026 due to contamination concerns. Model specifications, GPU hardware details, and pricing verified via web search as of March 20, 2026; see individual chapter source citations for complete references.

Appendix D. Timeline of Key Milestones (2017 to March 2026)Appendix F. Further Reading