NEWS

LLM Architecture Gallery

Article URL: https://sebastianraschka.com/llm-architecture-gallery/ Comments URL: https://news.ycombinator.com/item?id=47388676 Points: 341 # Comments: 27

By Hacker News

• March 15, 2026 at 4:01 PM

LLM Architecture Gallery Photo: Hacker News

This page collects architecture figures and fact sheets fromThe Big LLM Architecture ComparisonandA Dream of Spring for Open-Weight LLMs.

It focuses on the architecture panels only.

Click a figure to enlarge it and use the model title to jump to
the corresponding article section.

If you spot an inaccurate fact sheet, mislabeled architecture, or broken link, please file an issue here:Architecture Gallery issue tracker.

Upon popular request, you can now also get this as a physical poster viaZazzle.

The preview there may look a bit low-resolution, but the upload is based on a fresh high-resolution export at
14570 x 12490 pixels (a 56 MB PNG file with 182 megapixels).

I just ordered one myself but please be aware
that I haven't been able to verify the quality, yet.

Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.

Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.

DeepSeek's flagship template kicked off the recent wave of large open MoE models.

Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.

Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.

Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.

Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.

Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.

Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.

Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.

Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.

Compact dense model that experiments with leaving out positional encodings in selected layers.

Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.

Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.

Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.

OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.

Rare production-model release that shows an older MoE style with fewer, larger experts.

Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.

MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.

Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.

Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.

New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.

DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.

Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.

NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.

Large MoE model that pushes sliding-window attention harder than most contemporaries.

Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.

Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.

Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.

The Super variant scales up Nano and adds both latent experts and native speculative decoding support.

Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.

Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.

Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.

Compact multilingual model from Cohere with a rare parallel transformer block.

Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.

Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.

Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.

Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.

The original comparison article that walks through the architecture figures in context and explains the key
design choices across dense, MoE, MLA, and hybrid decoder families.

Follow-up article covering the additional open-weight architecture releases from early 2026, including the
newer MiniMax, Qwen, Ling, and Sarvam families.