Gemma 4 31B
Dense Gemma 4 scales the family to a 256K-context multimodal checkpoint without changing the core local-global recipe much.
Gemma 4 31B decoder block architecture: Attention: GQA + QK-Norm + SWA with QK-Norm with Sliding Window Attention. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 30.7B, 256K context, 64 layers. Decoder type: Dense.
Architecture Specifications
Key Features
Deep Dive
Overview
Gemma 4 31B is the dense variant of the Gemma 4 family, released April 2026 alongside the Gemma 4 26B-A4B MoE. At 30.7 B dense parameters it is a direct successor to Gemma 3 27B, with the same local-global attention interleave, QK-Norm, and soft-cap heritage but a doubled context window (128 K → 256 K). For teams that cannot or will not adopt MoE serving, Gemma 4 31B is the dense continuation of the Gemma 3 lineage.
Architecture at a Glance
| Parameter | Value | Notes |
|---|---|---|
| Total parameters | ≈ 30.7 B | dense |
| Layers | 60 | 50 sliding-window + 10 global (5:1) |
| Attention | GQA + QK-Norm + SWA | inherited from Gemma 3 |
| KV cache | ≈ 840 KiB/token | large — dense attention on 60 layers |
| Max position | 262,144 | 256 K native |
| Precision | bfloat16 |
840 KiB/Token KV Cache
Gemma 4 31B's KV cache at ≈ 840 KiB per token is the largest per-token footprint in this gallery, which is a direct consequence of running full-MHA-width attention across 60 layers on a dense stack. At the full 256 K native context this is ≈ 215 GiB of KV cache per sequence, which makes long-context serving expensive. The Gemma 4 31B dense variant is therefore best matched to shorter-context workloads where the per-token serving cost is amortized across reasonable sequence lengths; for 128 K+ context workloads, the Gemma 4 26B-A4B MoE variant is a much better economic fit.
5:1 Local-Global Preserved
Like Gemma 3 27B and Gemma 4 26B-A4B, Gemma 4 31B uses a 5:1 sliding-window to global ratio — 50 local layers + 10 global layers. This is the signature Gemma attention structure, chosen in Gemma 2 and scaled up with each release. Gemma 3 27B's dual-RoPE frequencies (local θ=10K, global θ=1M) and logit soft-capping (attention ≈ 50.0, output ≈ 30.0) both carry over to Gemma 4.
Verdict: The Dense Gemma Continuation
Gemma 4 31B is for teams that cannot adopt MoE serving: fine-tuning researchers working with standard dense-optimized frameworks, hardware environments without MoE kernel support, or anyone whose production pipeline is already tuned for Gemma 3 dense serving and does not want to migrate the expert-routing infrastructure. Architecturally it is a point upgrade to Gemma 3 27B with a doubled context window. For new deployments where MoE is on the table, Gemma 4 26B-A4B is the better economic choice.
References
Compare, evaluate, and deploy LLM architectures at scale
Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.