Skip to content
Appendix E. Model Comparison Table (March 2026)

Appendix E. Model Comparison Table (March 2026)

Choosing the right model in March 2026 means navigating a landscape of over two dozen frontier models from seven major providers, each with different architectures, context windows, pricing tiers, and licensing terms. This appendix organizes every frontier model referenced in this book into a single comparison table, with verified specifications as of March 20, 2026.

The table is designed for practical decision-making. If you need to pick a model for a specific use case, start here.


How to Read This Table

Each column captures a specific dimension of model capability:

  • Model: The official model name as used in API calls or documentation.
  • Provider: The company that created and serves the model.
  • Release Date: When the model became publicly available (API or weights).
  • Architecture: Dense or Mixture of Experts (MoE), with total and active parameter counts where known.
  • Context Window: Maximum input tokens supported via API. Some models have different limits for different tiers or require opt-in for extended context.
  • Max Output: Maximum tokens the model can generate in a single response.
  • API Pricing: Cost per million input and output tokens via the provider’s own API. Third-party providers (Groq, DeepInfra, Together AI) often offer lower prices for open-weight models.
  • License: Whether the model is closed (API-only), open-weight with a custom license, or open-weight under a standard permissive license (Apache 2.0, MIT).
  • Modalities: What input types the model accepts (text, images, audio, video) and what it can generate.
  • Reasoning: Whether the model supports extended thinking or chain-of-thought reasoning modes.
  • Chapter: Where this model is discussed in the book.

Pricing reflects the provider’s standard API rates. Cached input tokens, batch processing, and third-party hosting can reduce costs significantly (see Chapter 19 for caching and Chapter 24 for serving infrastructure).


E.1 Flagship Models (Highest Capability per Provider)

These are the most capable models from each major provider as of March 2026. Use these when you need the best possible performance and cost is secondary.

ModelProviderReleaseArchitectureContextMax OutputInput $/MTokOutput $/MTokLicenseModalitiesReasoningCh
GPT-5.4OpenAIMar 5, 2026Undisclosed1.05M (272K standard; 1.05M opt-in via API)128K$2.50$15.00ClosedText, image in; text outYes (5 effort levels incl. xhigh)11, 16, 20, 21, 23, 24
Claude Opus 4.6AnthropicFeb 5, 2026Undisclosed1M (GA since Mar 13, 2026)128K$5.00$25.00ClosedText, image in; text outYes (adaptive: low/med/high/max)11, 16, 17, 19, 20
Gemini 3.1 ProGoogleFeb 19, 2026Undisclosed1M64K$2.00$12.00ClosedText, image, audio, video in; text outYes (thinking levels)20, 24
Grok 4xAIJul 9, 2025Undisclosed256K16K$3.00$15.00ClosedText, image in; text outYes (reasoning-only)11, 16
Qwen 3-MaxAlibabaSep 2025 (preview); Jan 2026 (TTS upgrade)MoE, 1T+ total262K32K$1.20$6.00Closed (API only)Text in; text outYes (test-time scaling)13, 16
DeepSeek-V3.2DeepSeekDec 1, 2025MoE, 685B total / 37B active128K64K$0.28$0.42MITText in; text outYes (thinking mode)11, 12, 18, 24
Kimi K2.5Moonshot AIJan 26, 2026MoE, 1.04T total / 32B active (384 experts, top-8)256K64K$0.60$3.00MITText, image in; text outYes (thinking mode)N/A

Notes on flagship models:

GPT-5.4 is OpenAI’s most capable model as of March 2026. It merges the coding capabilities of GPT-5.3-Codex with the general reasoning of GPT-5.2 into a single model. The 1.05M context window is available via the API but requires opt-in; the standard context window is 272K tokens. Requests exceeding 272K are billed at 2x the normal input rate and 1.5x the output rate. GPT-5.4 also introduces native computer-use capabilities, scoring 75% on OSWorld-Verified (above the 72.4% human baseline), and tool search, which reduced total token usage by 47% on the MCP Atlas benchmark.

Claude Opus 4.6 is Anthropic’s flagship, released just three months after Opus 4.5. The 1M context window was initially in beta (requiring a specific API header), but Anthropic made it generally available on March 13, 2026, simultaneously removing the long-context surcharge. Requests of any length up to 1M tokens are now billed at the same per-token rate: $5/$25 per million tokens. Opus 4.6 achieves 80.8% on SWE-bench Verified (79.2% in Thinking mode per vals.ai).

Gemini 3.1 Pro is Google’s most capable Pro-tier model. The official DeepMind model card lists a 1M-token context window. Some third-party sources incorrectly claim 2M; the verified figure from the official model card is 1M. Pricing is $2/$12 per million tokens for contexts under 200K, with a surcharge of $4/$18 above 200K.

Grok 4 is xAI’s reasoning-first model. It has no non-reasoning mode. The 256K context window is confirmed in official developer documentation. Requests exceeding 128K tokens are billed at a higher extended-context rate. Grok 4 is available via SuperGrok subscriptions starting at $30/month or via the API.

Qwen 3-Max is Alibaba’s largest model, exceeding one trillion total parameters. It is available only via Alibaba Cloud Model Studio (API), not as downloadable weights. The January 2026 upgrade added test-time scaling (TTS), achieving 100% on AIME 2025. Pricing uses tiered rates based on input length: $1.20/$6.00 per million tokens for requests up to 32K input tokens, $2.40/$12.00 for 32K-128K, and $3.00/$15.00 for 128K-252K (International deployment; per official Alibaba Cloud pricing page, updated March 19, 2026).

DeepSeek-V3.2 is the most capable fully open-weight text-only model as of March 2026. It uses the same MoE architecture as DeepSeek-V3 (671B base) with continued pre-training, bringing the total to 685B parameters per the HuggingFace model card. The MIT license allows unrestricted commercial use. At $0.28/$0.42 per million tokens via the DeepSeek API, it is by far the cheapest frontier-class model. DeepSeek-V3.2 scores 70% on SWE-bench Verified and 94.2% on AIME 2026 (per telnyx.com); the V3.2-Speciale reasoning variant scores 73.1% on SWE-bench Verified and 96.0% on AIME 2025 (per beebom.com).

Kimi K2.5 is Moonshot AI’s flagship, released on January 26, 2026. It is the first open-weight native multimodal model at the trillion-parameter scale, built through continued pre-training on approximately 15 trillion mixed visual and text tokens atop the Kimi K2 base model. It uses a 384-expert MoE architecture with top-8 routing (plus one shared expert), activating approximately 32 billion parameters per token. Kimi K2.5 scored 76.8% on SWE-bench Verified at launch in standard mode, and 80.9% with Agent Swarm orchestration (per winbuzzer.com), making it the highest-scoring open-weight model on that benchmark. Its defining feature is Agent Swarm: the model can self-direct up to 100 sub-agents executing 1,500+ parallel tool calls. The MIT license allows unrestricted commercial use. Pricing via the Moonshot API is $0.60/$3.00 per million tokens (per costgoat.com citing official Moonshot platform pricing, updated February 2026); lower rates are available via third-party providers like Fireworks AI and DeepInfra. Cloudflare Workers AI also hosts Kimi K2.5 with a 256K context window.


E.2 Mid-Tier Models (Best Balance of Cost and Capability)

These models offer strong performance at significantly lower cost than the flagships. For most production workloads, one of these is the right choice.

ModelProviderReleaseArchitectureContextMax OutputInput $/MTokOutput $/MTokLicenseModalitiesReasoningCh
GPT-5.4 miniOpenAIMar 17, 2026Undisclosed400K128K$0.75$4.50ClosedText, image in; text outYes23, 24
Claude Sonnet 4.6AnthropicFeb 17, 2026Undisclosed1M (GA since Mar 13, 2026)128K$3.00$15.00ClosedText, image in; text outYes (hybrid)19
Gemini 3 FlashGoogleDec 17, 2025Undisclosed1M64K$0.50$3.00ClosedText, image, audio, video in; text outYes (thinking levels)24
Grok 4 Fast / 4.1 FastxAISep 19, 2025 / Nov 2025Undisclosed2M16K$0.20$0.50ClosedText, image in; text outYes (reasoning and non-reasoning SKUs)20
Grok 4.20 BetaxAIFeb 17, 2026 (Beta 2: Mar 3)MoE, ~3T total (estimated)2M256K$2.00$6.00ClosedText, image in; text, image outYes (multi-agent reasoning)N/A
Qwen 3.5 (397B/17B)AlibabaFeb 16, 2026MoE, 397B total / 17B active262K (1M via Qwen3.5-Plus hosted API)32K$0.60$3.60Apache 2.0Text, image, video in; text outYes (hybrid thinking)12, 17, 21, 22
DeepSeek-R1DeepSeekJan 20, 2025MoE, 671B total / 37B active128K64K$0.50$2.18MITText in; text outYes (chain-of-thought)15, 16

Notes on mid-tier models:

GPT-5.4 mini runs over 2x faster than GPT-5 mini while approaching flagship-level accuracy. It supports native computer-use capabilities and scores 72.1% on OSWorld-Verified, just 2.9 points below the flagship GPT-5.4. On SWE-Bench Pro, it scores 54.4% versus the flagship’s 57.7%, a gap of 3.3 percentage points at 70% lower cost. The 400K context window is confirmed from the official OpenAI announcement.

Claude Sonnet 4.6 delivers what Anthropic describes as “Opus-level intelligence at Sonnet pricing.” It shares the 1M context window with Opus 4.6 (GA since March 13, 2026) and supports hybrid reasoning (combining instant responses with extended thinking). It scores 79.6% on SWE-bench Verified (some sources report 80.2%) and 72.5% on OSWorld-Verified for computer use, within 0.2% of Opus 4.6. In Claude Code testing, developers preferred Sonnet 4.6 over Sonnet 4.5 70% of the time and over Opus 4.5 59% of the time.

Gemini 3 Flash is the default model in the Gemini app and Google AI Mode in Search. At $0.50/$3.00 per million tokens with a 1M context window, it offers the best value among closed-source models for most use cases. It scores 90.4% on GPQA Diamond and 78% on SWE-bench Verified.

Grok 4 Fast was released in September 2025 with a 2M-token context window, the largest of any model in this table. Grok 4.1 Fast (November 2025) updated the model with improved tool-calling. Both are available at $0.20/$0.50 per million tokens for requests under 128K context, with tiered pricing above that threshold.

Grok 4.20 Beta is xAI’s newest flagship, launched on February 17, 2026, with a second beta iteration on March 3 and a “Beta 0309” reasoning variant on March 9-10. Its defining feature is a native 4-agent collaboration system: four specialized sub-agents (Grok, Harper, Benjamin, Lucas) reason in parallel and debate internally before delivering a unified response. It builds on a ~3T parameter MoE backbone with a 2M-token context window and 256K max output. xAI claims a 65% reduction in hallucinations over Grok 4.1, and Grok 4.20 achieved a 78% non-hallucination rate on the Artificial Analysis Omniscience test, the highest ever recorded by any AI model (per popularaitools.ai). However, on the Artificial Analysis Intelligence Index, Grok 4.20 Beta scores 48 with reasoning enabled, trailing Gemini 3.1 Pro and GPT-5.4 at 57 (per the-decoder.com). Pricing is $2.00/$6.00 per million tokens. Note: Grok 4.20 is still in beta as of March 20, 2026; the official xAI pricing page (docs.x.ai/docs/models) confirms Grok 4.20 as the newest flagship with 2M context but did not render full per-token rates in the table at the time of verification, so the $2/$6 figure is sourced from developer.puter.com and ai-primer.com.

Qwen 3.5 is the most capable open-weight model family as of March 2026. The flagship 397B/17B model uses a hybrid architecture combining Gated DeltaNet (linear attention) with standard attention, plus multi-token prediction. It supports 201 languages and is released under Apache 2.0. The hosted Qwen3.5-Plus API extends the context window to 1M tokens.

DeepSeek-R1 is the model that made open-source reasoning mainstream. Released in January 2025, it matches OpenAI o1 on many reasoning benchmarks while being fully open under the MIT license. The R1-0528 update (May 2025) further closed the gap with o3.


E.3 Budget Models (Lowest Cost for High-Volume Workloads)

When you need to process millions of requests per day, or when the task is simple enough that a smaller model suffices, these models offer the best economics.

ModelProviderReleaseArchitectureContextMax OutputInput $/MTokOutput $/MTokLicenseModalitiesReasoningCh
GPT-5.4 nanoOpenAIMar 17, 2026Undisclosed128K32K$0.20$1.25ClosedText, image in; text outLimited24
Claude Haiku 4.5AnthropicOct 15, 2025Undisclosed200K8K$1.00$5.00ClosedText, image in; text outNo24
Gemini 3.1 Flash-LiteGoogleMar 3, 2026Undisclosed1M64K$0.25$1.50ClosedText, image, audio, video in; text outNo24
GPT-OSS 120BOpenAIAug 5, 2025MoE, 117B total / 5.1B active (128 experts, top-4)128K32K$0.15 (Groq)$0.60 (Groq)Apache 2.0Text in; text outYes (chain-of-thought)25
GPT-OSS 20BOpenAIAug 5, 2025MoE, 21B total / 3.6B active (32 experts, top-4)128K32K$0.075 (Groq)$0.30 (Groq)Apache 2.0Text in; text outYes (chain-of-thought)25

Notes on budget models:

GPT-5.4 nano is OpenAI’s cheapest model in the GPT-5 family. At $0.20 per million input tokens, it undercuts Google’s Gemini 3.1 Flash-Lite on price. OpenAI recommends it for classification, data extraction, ranking, and coding subagents.

Claude Haiku 4.5 is Anthropic’s speed tier. At $1/$5 per million tokens, it is more expensive than the budget options from OpenAI and Google, but it remains the fastest Claude model for latency-sensitive applications.

Gemini 3.1 Flash-Lite launched on March 3, 2026, as Google’s most cost-efficient model. At $0.25/$1.50 per million tokens with a full 1M context window, it offers the largest context window of any budget-tier model. It is 2.5x faster than its predecessor (Gemini 2.5 Flash-Lite).

GPT-OSS models are OpenAI’s first open-weight releases since GPT-2 in 2019. Both use MoE architectures with Apache 2.0 licensing. The 120B model fits on a single 80 GB GPU (H100 or MI300X) and matches o3-mini on many benchmarks. The 20B model runs on devices with just 16 GB of RAM. Pricing shown is via Groq; self-hosting eliminates per-token costs entirely.


E.4 Open-Weight Models (Downloadable Weights)

These models can be downloaded, self-hosted, and fine-tuned. They are listed separately because their economics are fundamentally different: you pay for compute (GPU hours) rather than per-token API fees. For high-volume workloads, self-hosting can be dramatically cheaper than API access.

ModelProviderReleaseArchitectureContextLicenseModalitiesReasoningCh
LLaMA 4 MaverickMetaApr 5, 2025MoE, 400B total / 17B active (128 experts)1MLlama 4 Community LicenseText, image in; text outNo9, 11, 12, 21, 22
LLaMA 4 ScoutMetaApr 5, 2025MoE, 109B total / 17B active (16 experts)10MLlama 4 Community LicenseText, image in; text outNo11, 12, 20
Qwen 3.5 (397B/17B)AlibabaFeb 16, 2026MoE, 397B total / 17B active262KApache 2.0Text, image, video in; text outYes (hybrid thinking)12, 17, 21, 22
Qwen 3.5 small series (0.8B to 35B)AlibabaMar 2, 2026Dense and MoE variants262KApache 2.0Text, image, video in; text outYes21, 28
DeepSeek-V3.2DeepSeekDec 1, 2025MoE, 685B total / 37B active128KMITText in; text outYes (thinking mode)11, 12, 18, 24
DeepSeek-R1DeepSeekJan 20, 2025MoE, 671B total / 37B active128KMITText in; text outYes (chain-of-thought)15, 16
Mistral Small 4Mistral AIMar 16, 2026MoE, 119B total / 6B active (128 experts, top-4)256KApache 2.0Text, image in; text outYes (configurable reasoning_effort)12, 21, 22
GPT-OSS 120BOpenAIAug 5, 2025MoE, 117B total / 5.1B active (128 experts, top-4)128KApache 2.0Text in; text outYes25
GPT-OSS 20BOpenAIAug 5, 2025MoE, 21B total / 3.6B active (32 experts, top-4)128KApache 2.0Text in; text outYes25
Kimi K2.5Moonshot AIJan 26, 2026MoE, 1.04T total / 32B active (384 experts, top-8)256KMITText, image in; text outYes (thinking mode)N/A

Notes on open-weight models:

LLaMA 4 Maverick and Scout are Meta’s first MoE models. Both use early fusion with a MetaCLIP vision encoder for native multimodal capabilities. Maverick (400B total, 128 experts) is the higher-capability variant; Scout (109B total, 16 experts) is designed for single-node deployment with a 10M-token context window. The Llama 4 Community License allows commercial use with no revenue restrictions but is not a standard open-source license (it includes specific acceptable use restrictions).

Third-party API pricing for LLaMA 4 Maverick varies by provider: $0.20/$0.60 via Groq, $0.50/$0.77 via Groq (at launch in April 2025), and lower rates via DeepInfra and Together AI. These prices change frequently; check provider pricing pages for current rates.

Qwen 3.5 is the most linguistically diverse open-weight model, supporting 201 languages. The flagship 397B/17B model introduces a hybrid architecture combining Gated DeltaNet (linear attention) with standard attention in a 3:1 ratio, plus multi-token prediction. The small series (0.8B to 35B) shares the same 262K context window and multimodal capabilities, making frontier-class features available on edge devices.

Mistral Small 4 is the newest open-weight model in this table, released on March 16, 2026. It unifies instruct, reasoning, and coding capabilities in a single model with configurable reasoning depth. The official blog lists 6B active parameters per token; the HuggingFace model card lists 6.5B (8B including embedding and output layers). It uses Multi-head Latent Attention (MLA), the same attention mechanism as DeepSeek-V3.

GPT-OSS models mark OpenAI’s return to open weights after six years. Both use SwiGLU activations, RMSNorm, Grouped-Query Attention with 8 KV heads, and RoPE for position encoding. The 120B model uses 36 layers with 128 experts (top-4 routing); the 20B model uses 24 layers with 32 experts (top-4 routing).

Kimi K2.5 is the largest open-weight model in this table by total parameter count (1.04 trillion). It adds native vision capabilities to the Kimi K2 base through a 400-million-parameter vision encoder called MoonViT, enabling multimodal tasks including replicating website user journeys from video demonstrations. The 384-expert MoE architecture activates 32 billion parameters per token (top-8 routing with one additional shared expert). At the NVIDIA GTC 2026 conference (March 18, 2026), Moonshot AI founder Yang Zhilin disclosed the core technology roadmap behind K2.5, including the KDA (Kimi Dense Attention) structure and Kimi Linear hybrid architecture that increases decoder speed 5-6x in 128K and 1M super-long context scenarios.


E.5 Reasoning Models (Specialized for Complex Problem-Solving)

These models are specifically designed or optimized for multi-step reasoning tasks: mathematics, coding, scientific analysis, and planning. They use extended thinking (generating internal reasoning tokens before producing a final answer) to improve accuracy on hard problems.

ModelProviderReleaseContextInput $/MTokOutput $/MTokKey BenchmarkLicenseCh
o3OpenAIApr 16, 2025200K$2.00$8.0096.7% AIME 2024, 87.7% GPQA DiamondClosed16
o4-miniOpenAIApr 16, 2025200K$1.10$4.4020% better than o3-miniClosed16
GPT-5.4 (Thinking mode)OpenAIMar 5, 20261.05M$2.50$15.0075% OSWorld-Verified, 77.2% SWE-bench VerifiedClosed16, 23
Claude Opus 4.6 (Thinking)AnthropicFeb 5, 20261M (GA)$5.00$25.0079.2% SWE-bench Verified (Thinking)Closed16
Gemini 2.5 Deep ThinkGoogleAug 1, 20251MPremium tierPremium tier87.6% LiveCodeBench, Bronze IMO 2025Closed16
DeepSeek-R1DeepSeekJan 20, 2025128K$0.50$2.1879.8% AIME 2024 (pass@1)MIT15, 16
DeepSeek-R1-0528DeepSeekMay 28, 2025128K$0.50$2.18Matches o3 and Gemini 2.5 Pro on reasoningMITAppendix D
Qwen 3-Max (TTS)AlibabaJan 2026 (TTS upgrade)262K$1.20$6.00100% AIME 2025Closed (API)13, 16

Notes on reasoning models:

The distinction between “reasoning model” and “general model with reasoning mode” is blurring. GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro all support configurable reasoning effort levels, allowing the same model to operate as either a fast general model or a deliberate reasoning model depending on the task. The dedicated reasoning models (o3, o4-mini, DeepSeek-R1) always use extended thinking.

o3 and o4-mini were released together on April 16, 2025. o3 is the more capable model; o4-mini is faster and cheaper. Both support a 200K context window. On June 10, 2025, OpenAI reduced o3 pricing by approximately 80% and released o3-pro for Pro users.

Thinking tokens (the internal reasoning the model generates before its final answer) are billed as output tokens at the standard output rate. This means reasoning models can be significantly more expensive per request than their headline pricing suggests, because a single response might generate thousands of thinking tokens before producing a short answer. Chapter 16 discusses this tradeoff in detail.

Note on SWE-bench Verified: In February 2026, OpenAI announced it would discontinue reporting results on SWE-bench Verified, citing significant contamination and test case flaws. OpenAI’s analysis found evidence that all major frontier models had been trained on benchmark solutions, and that 59% of failed test cases contained flaws (per blockchain.news, thenextgentechinsider.com). The benchmark scores listed in this appendix reflect the last reported figures and should be interpreted with this caveat in mind. Chapter 23 discusses the broader challenge of benchmark reliability.


E.6 Pricing Comparison: Input Tokens

This table ranks all models by input token cost, from cheapest to most expensive. It provides a quick reference for cost-sensitive decisions.

RankModelInput $/MTokProviderNotes
1GPT-OSS 20B$0.075OpenAI (via Groq)Open-weight, self-hostable
2GPT-OSS 120B$0.15OpenAI (via Groq)Open-weight, self-hostable
3Grok 4.1 Fast$0.20xAI2M context, under 128K pricing
4GPT-5.4 nano$0.20OpenAICheapest GPT-5 family model
5Gemini 3.1 Flash-Lite$0.25Google1M context at budget pricing
6DeepSeek-V3.2$0.28DeepSeekOpen-weight (MIT), self-hostable
7Gemini 3 Flash$0.50Google1M context, default Gemini model
8DeepSeek-R1$0.50DeepSeekOpen-weight reasoning model
9Kimi K2.5$0.60Moonshot AIOpen-weight (MIT), 1.04T MoE, self-hostable
10Qwen 3.5 (397B/17B)$0.60Alibaba (via providers)Open-weight (Apache 2.0)
11GPT-5.4 mini$0.75OpenAI400K context
12o4-mini$1.10OpenAIReasoning model
13Qwen 3-Max$1.20Alibaba1T+ parameters, API only
14Gemini 3.1 Pro$2.00GoogleUnder 200K context
14Grok 4.20 Beta$2.00xAIBeta; multi-agent; 2M context
15o3$2.00OpenAIReasoning model
16GPT-5.4$2.50OpenAIUnder 272K context
17Grok 4$3.00xAIReasoning-only model
18Claude Sonnet 4.6$3.00Anthropic1M context (GA)
19Claude Opus 4.6$5.00Anthropic1M context (GA)

The price range spans 67x, from $0.075 (GPT-OSS 20B via Groq) to $5.00 (Claude Opus 4.6) per million input tokens. Open-weight models that can be self-hosted (DeepSeek-V3.2, Qwen 3.5, GPT-OSS, Mistral Small 4, Kimi K2.5) eliminate per-token costs entirely, replacing them with fixed GPU infrastructure costs. Chapter 24 covers the economics of self-hosting in detail. Note that Grok 4.20 Beta ($2.00/MTok) and Qwen 3-Max ($1.20/MTok base tier) use tiered pricing that increases with input length; the table shows the lowest tier for each.


E.7 Context Window Comparison

Context window size determines how much information a model can process in a single request. This table ranks models by maximum supported context window.

RankModelContext WindowNotes
1LLaMA 4 Scout10,000,000Open-weight; NIAH accuracy drops to 89% at 10M
2Grok 4 Fast / 4.1 Fast2,000,000Tiered pricing above 128K
2Grok 4.20 Beta2,000,000Beta; multi-agent architecture; 256K max output
3GPT-5.41,050,000Opt-in via API; 272K standard; surcharge above 272K
4LLaMA 4 Maverick1,000,000Open-weight
5Claude Opus 4.61,000,000GA since March 13, 2026; standard pricing at all lengths
6Claude Sonnet 4.61,000,000GA since March 13, 2026; standard pricing at all lengths
7Gemini 3.1 Pro1,000,000Per official DeepMind model card
8Gemini 3 Flash1,000,000Default Gemini model
9Gemini 3.1 Flash-Lite1,000,000Budget tier with full context
10Qwen 3.5 (via Qwen3.5-Plus API)1,000,000Hosted API extends from 262K native
11GPT-5.4 mini400,000Confirmed from official announcement
12Qwen 3-Max262,000API only
13Qwen 3.5 (open-weight)262,000Native context; 1M via hosted API
14Mistral Small 4256,000Open-weight (Apache 2.0)
14Kimi K2.5256,000Open-weight (MIT), native multimodal, 1.04T total
15Grok 4256,000Extended-context pricing above 128K
16Claude Haiku 4.5200,000Speed tier
17o3 / o4-mini200,000Reasoning models
18DeepSeek-V3.2128,000Open-weight (MIT)
19DeepSeek-R1128,000Open-weight (MIT)
20GPT-5.4 nano128,000Budget tier
21GPT-OSS 120B / 20B128,000Open-weight (Apache 2.0)

The range spans 78x, from 128K tokens (DeepSeek, GPT-OSS, GPT-5.4 nano) to 10M tokens (LLaMA 4 Scout). However, raw context window size does not tell the full story. Chapter 20 discusses the “context rot” problem: model performance degrades on information placed in the middle of very long contexts. LLaMA 4 Scout’s NIAH (Needle-in-a-Haystack) accuracy drops from 95%+ at 8M tokens to 89% at 10M. GPT-5.4 shows accuracy degradation on high-complexity reasoning beyond 256K tokens. As of March 13, 2026, Anthropic made its 1M context window generally available for Opus 4.6 and Sonnet 4.6 at standard pricing, removing the previous beta header requirement and long-context surcharge entirely.


E.8 Architecture Comparison: Dense vs. MoE

As of March 2026, every major open-weight frontier model uses a Mixture of Experts architecture. This table highlights the architectural details of models where they are publicly known.

ModelTotal ParamsActive ParamsExpertsTop-KArchitecture Notes
Qwen 3-Max1T+UndisclosedMoEUndisclosedLargest known model by total parameters
Kimi K2.51.04T32B384Top-8MoonViT vision encoder, Agent Swarm, KDA architecture
DeepSeek-V3.2685B37B256 routed + 1 sharedTop-2MLA attention, DeepSeek Sparse Attention
DeepSeek-R1671B37B256 routed + 1 sharedTop-2Same base as V3, reasoning-tuned
LLaMA 4 Maverick400B17B128Top-1Early fusion multimodal, MetaCLIP vision encoder
Qwen 3.5 (flagship)397B17BMoEVariesHybrid Gated DeltaNet + standard attention, multi-token prediction
Mistral Small 4119B6B (6.5B per HF)128Top-4MLA attention, YaRN rope scaling
GPT-OSS 120B117B5.1B128Top-4GQA with 8 KV heads, SwiGLU, RMSNorm
LLaMA 4 Scout109B17B16Top-1Same architecture as Maverick, fewer experts
GPT-OSS 20B21B3.6B32Top-4Runs on 16 GB RAM

Key patterns in March 2026 MoE architectures:

The “large total, small active” pattern is universal. Every open-weight frontier model activates between 4% and 16% of its total parameters per token. This gives models the knowledge capacity of a very large model (stored across all experts) with the inference speed of a much smaller one (only the active parameters are computed per token).

Three distinct MoE scales have emerged:

  1. Compact MoE (100-120B total, 5-17B active): LLaMA 4 Scout, GPT-OSS 120B, Mistral Small 4. These fit on a single high-end GPU or a small multi-GPU setup.
  2. Mid-range MoE (400-700B total, 17-37B active): LLaMA 4 Maverick, Qwen 3.5, DeepSeek-V3.2. These require multi-GPU setups but offer the best quality-to-cost ratio.
  3. Frontier MoE (1T+ total): Kimi K2.5 (1.04T, 32B active, MIT, open-weight), Qwen 3-Max (1T+, API-only), Grok 3 (estimated ~3T, API-only). Kimi K2.5 is the only model in this tier available as downloadable weights.

Chapter 12 covers MoE architecture in detail, including routing mechanisms, load balancing, and the tradeoffs between expert count and active parameter count.


E.9 License Comparison

The licensing landscape for LLMs in March 2026 spans a wide spectrum, from fully closed APIs to permissive open-source licenses.

License TypeModelsKey Terms
Closed (API only)GPT-5.4, GPT-5.4 mini/nano, o3, o4-mini, Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 3.1 Pro, Gemini 3 Flash, Gemini 3.1 Flash-Lite, Grok 4, Grok 4 Fast/4.1 Fast, Grok 4.20 Beta, Qwen 3-MaxNo access to weights. Usage governed by provider terms of service.
Apache 2.0Qwen 3.5 (all sizes), Mistral Small 4, GPT-OSS 120B, GPT-OSS 20BStandard permissive license. Commercial use, modification, and redistribution allowed with attribution. No restrictions on downstream use.
MITDeepSeek-V3.2, DeepSeek-R1, Kimi K2.5Most permissive standard license. Commercial use, modification, and redistribution allowed.
Llama 4 Community LicenseLLaMA 4 Maverick, LLaMA 4 ScoutCustom license from Meta. Commercial use allowed with no revenue restrictions. Includes acceptable use restrictions. Not a standard open-source license.

The trend is clear: open-weight models are converging on standard permissive licenses. In 2023, Meta’s LLaMA 2 used a custom license with a 700M monthly active user threshold. By 2025, OpenAI released GPT-OSS under Apache 2.0, and DeepSeek uses MIT for all its models. In January 2026, Moonshot AI released Kimi K2.5 under MIT, making it the largest open-weight model under a standard permissive license. The practical difference between Apache 2.0 and MIT is minimal for most users; both allow unrestricted commercial use.


E.10 Choosing the Right Model: Decision Framework

With so many options, here is a practical decision framework based on common use cases:

For general-purpose chat and content generation:

  • Budget: Gemini 3.1 Flash-Lite ($0.25/MTok) or GPT-5.4 nano ($0.20/MTok)
  • Balanced: Gemini 3 Flash ($0.50/MTok) or GPT-5.4 mini ($0.75/MTok)
  • Best quality: GPT-5.4 ($2.50/MTok) or Claude Opus 4.6 ($5.00/MTok)

For coding and software engineering:

  • Budget: DeepSeek-V3.2 ($0.28/MTok, self-hostable)
  • Balanced: Claude Sonnet 4.6 ($3.00/MTok) or GPT-5.4 mini ($0.75/MTok)
  • Best quality: Claude Opus 4.6 Thinking (79.2% SWE-bench) or GPT-5.4 (77.2% SWE-bench)

For complex reasoning (math, science, logic):

  • Budget: DeepSeek-R1 ($0.50/MTok, self-hostable)
  • Balanced: o4-mini ($1.10/MTok) or Qwen 3-Max ($1.20/MTok)
  • Best quality: o3 ($2.00/MTok) or GPT-5.4 Thinking ($2.50/MTok)

For long-context workloads (100K+ tokens):

  • Budget: Gemini 3.1 Flash-Lite ($0.25/MTok, 1M context)
  • Balanced: Grok 4.1 Fast ($0.20/MTok, 2M context) or Gemini 3 Flash ($0.50/MTok, 1M context)
  • Best quality: GPT-5.4 ($2.50/MTok, 1.05M) or Claude Opus 4.6 ($5.00/MTok, 1M GA)

For multimodal workloads (images, audio, video):

  • Budget: Gemini 3.1 Flash-Lite (text, image, audio, video input at $0.25/MTok)
  • Balanced: Gemini 3 Flash (same modalities at $0.50/MTok)
  • Best quality: Gemini 3.1 Pro ($2.00/MTok) or GPT-5.4 ($2.50/MTok, text + image only)
  • Open-weight: Qwen 3.5 (text, image, video) or LLaMA 4 Maverick (text, image)

For self-hosting and fine-tuning:

  • Smallest footprint: GPT-OSS 20B (16 GB RAM, Apache 2.0)
  • Best quality/cost: DeepSeek-V3.2 (685B total, MIT) or Qwen 3.5 397B (Apache 2.0)
  • Best for agentic workloads: Kimi K2.5 (1.04T total, MIT, Agent Swarm with 100+ parallel sub-agents)
  • Best for edge/mobile: Qwen 3.5 small series (0.8B to 9B, Apache 2.0)
  • Best for fine-tuning: Qwen3-8B or Qwen3.5-9B with LoRA (Chapter 28)

E.11 Key Takeaways

  • The price range spans 67x for input tokens alone, from $0.075 (GPT-OSS 20B via Groq) to $5.00 (Claude Opus 4.6) per million tokens. For most production workloads, the mid-tier and budget models offer sufficient quality at a fraction of the flagship cost.

  • Every major open-weight model uses MoE as of March 2026. The “total parameters / active parameters” ratio is the key metric: it determines both the model’s knowledge capacity and its inference cost. Typical ratios range from 10:1 (Mistral Small 4 at 119B/6B) to ~32:1 (Kimi K2.5 at 1.04T/32B).

  • Context windows have stratified into three tiers: budget models at 128-200K, standard models at 256K-1M, and long-context specialists at 1-10M. The 1M-token tier is now standard for flagship models from all major providers. As of March 13, 2026, Anthropic made its 1M context window generally available for Opus 4.6 and Sonnet 4.6 at standard pricing, with no surcharge at any length.

  • Open-weight models have reached parity with closed models on most benchmarks. DeepSeek-V3.2 matches GPT-5 performance at roughly 10x lower API cost. Kimi K2.5 scored 76.8% on SWE-bench Verified at launch (80.9% with Agent Swarm orchestration), the highest open-weight score on that benchmark. Qwen 3.5 claims to beat GPT-5.2 and Claude Opus 4.5 across 80% of benchmark categories. The gap between open and closed models is the narrowest it has ever been. Note that OpenAI discontinued SWE-bench Verified reporting in February 2026 due to contamination concerns (see E.5 notes), so benchmark comparisons should be interpreted with caution.

  • Reasoning capabilities are now table stakes. Every flagship model supports some form of extended thinking or configurable reasoning effort. The dedicated reasoning models (o3, DeepSeek-R1) still lead on the hardest benchmarks, but the gap is closing as general models add reasoning modes.

  • Multimodal input is standard; multimodal output is not. Every model in this table accepts text input. Most accept images. Google’s Gemini models uniquely accept audio and video natively. But only a few models (not listed in this table) can generate images or audio natively; see Chapter 22 for details.

  • Licensing has shifted toward permissive open-source. Apache 2.0 and MIT now cover models from Alibaba (Qwen 3.5), DeepSeek (V3.2, R1), Mistral (Small 4), Moonshot AI (Kimi K2.5), and even OpenAI (GPT-OSS). Meta’s Llama 4 Community License is the notable exception, using a custom license rather than a standard one.

  • Self-hosting economics favor open-weight MoE models. A model like Mistral Small 4 (119B total, 6B active, Apache 2.0) or GPT-OSS 120B (117B total, 5.1B active, Apache 2.0) can run on a single H100 GPU. At cloud GPU rates of $1.25-$3.00/hour, self-hosting becomes cheaper than API access at roughly 1-5 million tokens per hour of usage. Chapter 24 and Appendix B cover the hardware requirements in detail.

  • Agentic capabilities are now a key differentiator for open-weight models. Kimi K2.5 (January 2026) introduced Agent Swarm, enabling a single model to self-direct up to 100 sub-agents executing 1,500+ parallel tool calls. This represents a shift from models that merely support tool calling to models that natively orchestrate complex multi-agent workflows. Combined with its MIT license and 1.04-trillion-parameter scale, Kimi K2.5 demonstrates that open-weight models can compete with closed models not just on benchmarks but on agentic architecture.

  • Multi-agent architectures are emerging at the model level. Grok 4.20 Beta (February 2026) introduced a native 4-agent collaboration system where specialized sub-agents reason in parallel and debate before producing a response. This is distinct from application-level multi-agent frameworks (Chapter 23, Chapter 29); the collaboration happens inside the model itself. Whether this approach delivers consistent improvements over single-model reasoning remains to be validated as the model exits beta.

Appendix F provides further reading for staying current as new models are released after this book’s publication date.


Sources: All specifications, pricing, and release dates in this appendix are verified via web search as of March 20, 2026, and cross-referenced with the source citations in each chapter. Key primary sources include: OpenAI official pricing page (openai.com/api/pricing) and GPT-5.4 deep dive (community.openai.com/t/gpt-5-4-deep-dive-pricing-context-limits-and-tool-search-explained/1375800) confirming $2.50/$15.00 per million tokens, 1.05M context window, 272K standard context, 2x/1.5x surcharge above 272K. OpenAI GPT-5.4 mini and nano announcement (openai.com/index/introducing-gpt-5-4-mini-and-nano) confirming mini at $0.75/$4.50 with 400K context, 72.1% OSWorld-Verified, 54.4% SWE-Bench Pro, and nano at $0.20/$1.25; mini pricing also confirmed by pulse24.ai, implicator.ai, and buildfastwithai.com; mini OSWorld and SWE-Bench Pro scores confirmed by awesomeagents.ai and innovation-village.com. OpenAI GPT-5 launch (openai.com/gpt-5) confirming $1.25/$10.00 per million tokens, 400K context, August 7, 2025 release. OpenAI o3 and o4-mini announcement (openai.com/index/introducing-o3-and-o4-mini) confirming o3 at $2.00/$8.00 and o4-mini at $1.10/$4.40, both with 200K context, April 16, 2025 release; pricing confirmed by simonwillison.net and langcopilot.com. OpenAI GPT-OSS announcement (openai.com/index/introducing-gpt-oss) confirming 120B (117B total, 5.1B active, 128 experts) and 20B (21B total, 3.6B active, 32 experts), both Apache 2.0, 128K context, August 5, 2025 release; architecture details confirmed from arxiv.org/html/2508.12461v1 and cometapi.com. OpenAI SWE-bench Verified discontinuation (openai.com/index/why-we-no-longer-evaluate-swe-bench-verified, blockchain.news, thenextgentechinsider.com) citing contamination and 59% flawed test cases, February 2026. Groq pricing page (groq.com/pricing) confirming LLaMA 4 Maverick at $0.20/$0.60, LLaMA 4 Scout at $0.11/$0.34, GPT-OSS 120B at $0.15/$0.60, GPT-OSS 20B at $0.075/$0.30 per million tokens. Anthropic Claude Opus 4.6 announcement (anthropic.com/research/claude-opus-4-6) confirming February 5, 2026 release, $5/$25 per million tokens, 128K max output, 80.8% SWE-bench Verified; pricing confirmed by curlscape.com, karangoyal.cc, gaga.art. Anthropic 1M context GA announcement (claude.com/blog/1m-context-ga) confirming March 13, 2026 general availability of 1M context for Opus 4.6 and Sonnet 4.6 at standard pricing with no surcharge at any length; confirmed by blockchain.news, the-decoder.com, karangoyal.cc, thenextgentechinsider.com, cursor.com. Claude Sonnet 4.6 released February 17, 2026 at $3/$15 with 79.6% SWE-bench Verified (some sources report 80.2% per bytebot.io) and 72.5% OSWorld-Verified (digitalapplied.com, nxcode.io, bytebot.io, claudefa.st, businessworld.in); developers preferred Sonnet 4.6 over Sonnet 4.5 70% of the time and over Opus 4.5 59% of the time (bytebot.io). Claude Haiku 4.5 at $1/$5 released October 15, 2025 (curlscape.com, pecollective.com, siliconrepublic.com, anthropic.com/news/claude-haiku-4-5). Google DeepMind Gemini 3.1 Pro model card (deepmind.google/models/model-cards/gemini-3-1-pro) confirming 1M context window and 64K token output; pricing $2/$12 confirmed by felloai.com, digitalapplied.com, thedeepview.com, theneuron.ai; max output 65,536 tokens (64K) confirmed by aifreeapi.com, apidog.com, replicate.com. Gemini 3 Flash at $0.50/$3.00 with 1M context and 64K output (langcopilot.com, spectrumailab.com, digitalapplied.com, aifreeapi.com); 90.4% GPQA Diamond and 78% SWE-bench Verified (spectrumailab.com, introl.com, gaga.art). Gemini 3.1 Flash-Lite at $0.25/$1.50 with 1M context, March 3, 2026 release (verdent.ai, venkatsoftware.com, launchberg.com, futunn.com, gaga.art, iweaver.ai). xAI Grok 4 at $3/$15 with 256K context in API, 128K in app (aifreeapi.com, costgoat.com, datastudios.org, apidog.com, thenextgentechinsider.com). Grok 4 Fast at $0.20/$0.50 with 2M context (x.ai/news/grok-4-fast, natural20.com, testingcatalog.com, langcopilot.com). Grok 4.1 Fast announcement (x.ai/news/grok-4-1-fast) confirming 2M context and agent tools API; pricing confirmed by ainvest.com, mem0.ai, costgoat.com. Grok 4.20 Beta launched February 17, 2026, Beta 2 on March 3, Beta 0309 reasoning variant March 9-10, with native 4-agent collaboration system, ~3T MoE backbone, 2M context, 256K max output, 65% hallucination reduction over Grok 4.1, 78% non-hallucination rate on Artificial Analysis Omniscience test (popularaitools.ai), Intelligence Index score 48 vs GPT-5.4/Gemini 3.1 Pro at 57 (the-decoder.com); pricing $2/$6 per million tokens (developer.puter.com, ai-primer.com, thenextgentechinsider.com, aibase.com); official xAI pricing page (docs.x.ai/docs/models) confirms Grok 4.20 as newest flagship with 2M context but did not render per-token rates in table at time of verification. DeepSeek API pricing March 2026: V3.2 at $0.28/$0.42, R1 at $0.50/$2.18 (tldl.io/resources/deepseek-api-pricing, deepseak.org). DeepSeek-V3.2 model card (huggingface.co/deepseek-ai/DeepSeek-V3.2) confirming 685B total parameters, 37B active, MIT license, 128K context; SWE-bench Verified 70% and AIME 2026 94.2% (telnyx.com); V3.2-Speciale variant 73.1% SWE-bench Verified and 96.0% AIME 2025 (beebom.com). DeepSeek-R1 paper (arxiv.org/abs/2501.12948) confirming 671B total, 37B active, January 20, 2025 release. Meta LLaMA 4 release (huggingface.co/blog/llama4-release) confirming Maverick 400B/17B/128 experts, Scout 109B/17B/16 experts, April 5, 2025 release, Llama 4 Community License. Llama 4 Community License text (ollama.com/library/llama4/blobs/399a8a5a36db). Alibaba Cloud Model Studio official pricing page (alibabacloud.com/help/en/model-studio/billing-for-model-studio, updated March 19, 2026) confirming qwen3-max tiered pricing: $1.20/$6.00 for 0-32K input tokens, $2.40/$12.00 for 32K-128K, $3.00/$15.00 for 128K-252K (International deployment). Alibaba Qwen 3.5 announcement (qwen-ai.com) confirming 397B/17B, February 16, 2026 release, Apache 2.0, 262K native context, 201 languages; pricing $0.60/$3.60 via third-party providers (anotherwrapper.com). Qwen 3-Max at $1.20/$6.00 base tier with 262K context (galaxy.ai/model/qwen3-max, qwen-ai.com, alibabacloud.com). Mistral Small 4 announcement (mistral.ai/news/mistral-small-4) confirming 119B total, 6B active, 128 experts, top-4, 256K context, Apache 2.0, March 16, 2026 release; HuggingFace model card (huggingface.co/mistralai/Mistral-Small-4-119B-2603) listing 6.5B active. Moonshot AI Kimi K2.5 released January 26-27, 2026, 1.04T total parameters (confirmed from huggingface.co/blog/mlabonne/kimik25 and ai-rockstars.com), 32B active, 384 experts with top-8 routing plus one shared expert (confirmed from huggingface.co/blog/mlabonne/kimik25: ‘384 experts with 8 activated per token’, scalebytech.com, ycombinator.com/item?id=44639828), 256K context, MIT license (modelslab.com, a2aprotocol.ai, cloudflare.com/workers-ai-large-models, wikipedia.org/wiki/Moonshot_AI, deeplearning.ai, llm-stats.com); 76.8% SWE-bench Verified at launch in standard mode (modelslab.com, recapio.com); 80.9% with Agent Swarm orchestration (winbuzzer.com); Agent Swarm with 100+ sub-agents and 1,500+ parallel tool calls (a2aprotocol.ai); MoonViT 400M-parameter vision encoder for native multimodal (wikipedia.org/wiki/Moonshot_AI); continued pre-training on ~15T mixed visual and text tokens (startuphub.ai, softmaxdata.com); pricing $0.60/$3.00 via Moonshot API (costgoat.com/pricing/kimi-api citing official Moonshot platform pricing, updated February 2026; codecademy.com reports $0.60/$2.50 for output tokens via some providers); KDA architecture and Kimi Linear hybrid disclosed at NVIDIA GTC 2026 March 18 (1ai.net, aibase.com); Cloudflare Workers AI hosting with 256K context (blog.cloudflare.com/workers-ai-large-models). Claude Opus 4.5 at $5/$25 with 200K context, November 24, 2025 release (anthropic.com/news/claude-opus-4-5). Gemini 2.5 Deep Think August 1, 2025, 87.6% LiveCodeBench (9to5google.com, neowin.net). GPT-5.4 SWE-bench Verified 77.20% (vals.ai/benchmarks/swebench, March 2026). All benchmark scores cross-referenced with vals.ai/benchmarks where available (vals.ai/benchmarks/swebench confirming Claude Opus 4.6 Thinking 79.20%, Claude Opus 4.6 80.8%, GPT-5.4 77.20%, Gemini 3 Flash 76.20% on SWE-bench Verified as of March 2026); note that OpenAI discontinued SWE-bench Verified reporting in February 2026 due to contamination concerns. Model specifications, GPU hardware details, and pricing verified via web search as of March 20, 2026; see individual chapter source citations for complete references.