Llama 3.2 3B decoder block architecture: Attention: GQA. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 3B, 128K context, 24 layers. Decoder type: Dense.
Llama 3.2 3B
Meta · Unknown
Scale
3B
Context
128K
A comprehensive catalog of large language model architectures — decoder types, attention mechanisms, parameter counts, and context windows — curated for enterprise AI research and evaluation.
Colaberry AI catalogs 79+ large language model architectures from 23 organizations including Meta, Google, OpenAI, DeepSeek, Alibaba, and Mistral. The gallery covers Dense Transformers, Mixture-of-Experts (MoE), Hybrid SSM-Transformer, and Recurrent models with architecture specifications, attention mechanisms, and context window sizes.
Find architectures by decoder type, organization, parameters, and features.
Showing 24 of 79 architectures
Llama 3.2 3B decoder block architecture: Attention: GQA. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 3B, 128K context, 24 layers. Decoder type: Dense.
Meta · Unknown
Scale
3B
Context
128K
Gemma 4 26B-A4B decoder block architecture: Attention: GQA + QK-Norm + SWA with QK-Norm with Sliding Window Attention. Normalization: RMSNorm. FFN: Mixture of Experts (3.8B active parameters). Position encoding: RoPE. Scale: 25.2B, 256K context, 40 layers. Decoder type: MoE.
Google · 2026-04
Scale
3.8B / 25.2B
Context
256K
Gemma 4 31B decoder block architecture: Attention: GQA + QK-Norm + SWA with QK-Norm with Sliding Window Attention. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 30.7B, 256K context, 64 layers. Decoder type: Dense.
Google · 2026-04
Scale
30.7B
Context
256K
Gemma 4 (E2B) decoder block architecture: Attention: MQA + QK-Norm + SWA with QK-Norm with Sliding Window Attention. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 5.1B, 128K context, 24 layers. Decoder type: Dense.
Google · 2026-04
Scale
5.1B
Context
128K
Gemma 4 (E4B) decoder block architecture: Attention: GQA + QK-Norm + SWA with QK-Norm with Sliding Window Attention. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 8B, 128K context, 32 layers. Decoder type: Dense.
Google · 2026-04
Scale
8B
Context
128K
GLM-5.1 decoder block architecture: Attention: MLA + Sparse Attention. Normalization: RMSNorm. FFN: Mixture of Experts (40B active parameters). Position encoding: RoPE. Scale: 744B, 202K context, 78 layers. Decoder type: MoE.
Zhipu AI · 2026-04
Scale
40B / 744B
Context
202K
Mistral Small 4 decoder block architecture: Attention: MLA. Normalization: RMSNorm. FFN: Mixture of Experts (6.63B active parameters). Position encoding: RoPE. Scale: 119B, 256K context, 96 layers. Decoder type: MoE.
Mistral · 2026-03
Scale
6.63B / 119B
Context
256K
Nemotron 3 Nano 4B decoder block architecture: Attention: GQA + only 4 attention layers. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 4B, 262K context, 42 layers. Decoder type: Hybrid.
NVIDIA · 2026-03
Scale
4B
Context
262K
Nemotron 3 Super 120B-A12B decoder block architecture: Attention: Mostly Mamba-2 + a few GQA layers. Normalization: RMSNorm. FFN: Mixture of Experts (12B active parameters). Position encoding: RoPE. Scale: 120B, 1M context, 88 layers. Decoder type: MoE.
NVIDIA · 2026-03
Scale
12B / 120B
Context
1M
Sarvam 105B decoder block architecture: Attention: MLA + KV LayerNorm + NoPE + RoPE. Normalization: RMSNorm. FFN: Mixture of Experts (10.3B active parameters). Position encoding: RoPE. Scale: 105B, 131K context, 32 layers. Decoder type: MoE.
Unknown · 2026-03
Scale
10.3B / 105B
Context
131K
Sarvam 30B decoder block architecture: Attention: GQA + QK-Norm with QK-Norm. Normalization: RMSNorm. FFN: Mixture of Experts (2.4B active parameters). Position encoding: RoPE. Scale: 30B, 131K context, 19 layers. Decoder type: MoE.
Unknown · 2026-03
Scale
2.4B / 30B
Context
131K
Nemotron 3 Super decoder block architecture: Attention: Mostly Mamba-2 + GQA. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 120B, 1M context, 96 layers. Decoder type: Hybrid.
NVIDIA · 2026-03
Scale
12B / 120B
Context
1M
GLM-5 744B decoder block architecture: Attention: MLA + DeepSeek Sparse Attention. Normalization: RMSNorm. FFN: Mixture of Experts (40B active parameters). Position encoding: RoPE. Scale: 744B, 203K context, 78 layers. Decoder type: MoE.
Zhipu AI · 2026-02
Scale
40B / 744B
Context
203K
Ling 2.5 1T decoder block architecture: Attention: Lightning Attention plus MLA. Normalization: RMSNorm. FFN: Mixture of Experts (63B active parameters). Position encoding: RoPE. Scale: 1T, 256K context, 80 layers. Decoder type: MoE.
Unknown · 2026-02
Scale
63B / 1T
Context
256K
MiniMax M2.5 230B decoder block architecture: Attention: GQA + QK-Norm with QK-Norm. Normalization: RMSNorm. FFN: Mixture of Experts (10B active parameters). Position encoding: RoPE. Scale: 230B, 197K context, 62 layers. Decoder type: MoE.
Unknown · 2026-02
Scale
10B / 230B
Context
197K
Nanbeige 4.1 3B decoder block architecture: Attention: GQA. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 3B, 262K context, 32 layers. Decoder type: Dense.
Unknown · 2026-02
Scale
3B
Context
262K
Qwen3.5 397B decoder block architecture: Attention: 3:1 Gated DeltaNet + Gated Attn. Normalization: RMSNorm. FFN: Mixture of Experts (17B active parameters). Position encoding: RoPE. Scale: 397B, 262K context, 128 layers. Decoder type: MoE.
Alibaba · 2026-02
Scale
17B / 397B
Context
262K
Step 3.5 Flash 196B decoder block architecture: Attention: GQA + 3:1 SWA attention with Sliding Window Attention. Normalization: RMSNorm. FFN: Mixture of Experts (11B active parameters). Position encoding: RoPE. Scale: 196B, 262K context, 45 layers. Decoder type: MoE.
Unknown · 2026-02
Scale
11B / 196B
Context
262K
Tiny Aya 3.35B decoder block architecture: Attention: GQA + 3:1 SWA attention with Sliding Window Attention. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 3.35B, 8K context, 24 layers. Decoder type: Dense.
Cohere · 2026-02
Scale
3.35B
Context
8K
GLM-5 decoder block architecture: Attention: MLA + Sparse Attention. Normalization: RMSNorm. FFN: Mixture of Experts (40B active parameters). Position encoding: RoPE. Scale: 744B, 202K context, 128 layers. Decoder type: MoE.
Zhipu AI · 2026-02
Scale
40B / 744B
Context
202K
Step 3.5 Flash decoder block architecture: Attention: GQA + SWA with Sliding Window Attention. Normalization: RMSNorm. FFN: Mixture of Experts (11B active parameters). Position encoding: RoPE. Scale: 196B, 262K context, 96 layers. Decoder type: MoE.
StepFun · 2026-02
Scale
11B / 196B
Context
262K
Nanbeige 4.1 decoder block architecture: Attention: GQA. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 3B, 262K context, 24 layers. Decoder type: Dense.
Nanbeige · 2026-02
Scale
3B
Context
262K
MiniMax-M2.5 decoder block architecture: Attention: GQA + QK-Norm with QK-Norm. Normalization: RMSNorm. FFN: Mixture of Experts (10B active parameters). Position encoding: RoPE. Scale: 230B, 196K context, 128 layers. Decoder type: MoE.
MiniMax · 2026-02
Scale
10B / 230B
Context
196K
Tiny Aya decoder block architecture: Attention: GQA + SWA + NoPE with Sliding Window Attention. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: NoPE. Scale: 3.35B, 8,192 context, 24 layers. Decoder type: Dense.
Cohere · 2026-02
Scale
3.35B
Context
8,192
Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance across dense transformers, MoE, hybrid, and recurrent models.