Skip to content
DenseVerified
Google · 2026-04

Gemma 4 31B

Dense Gemma 4 scales the family to a 256K-context multimodal checkpoint without changing the core local-global recipe much.

Gemma 4 31B decoder block architecture: Attention: GQA + QK-Norm + SWA with QK-Norm with Sliding Window Attention. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 30.7B, 256K context, 64 layers. Decoder type: Dense.

GQA + QK-Norm + SWA·SwiGLU
30.7B|256K context|GQA + QK-Norm + SWA|Dense

Architecture Specifications

Parameters30.7B
Context Window256K
Decoder TypeDense
AttentionGQA + QK-Norm + SWA
Vocabulary Size262K
Release Date2026-04
CategoryLong Context
OrganizationGoogle

Key Features

Grouped Query AttentionSliding Window AttentionQK normalizationLayer mix: 50 sliding-window + 10 globalKV cache: 840 KiB/token

Deep Dive

Overview

Gemma 4 31B is the dense variant of the Gemma 4 family, released April 2026 alongside the Gemma 4 26B-A4B MoE. At 30.7 B dense parameters it is a direct successor to Gemma 3 27B, with the same local-global attention interleave, QK-Norm, and soft-cap heritage but a doubled context window (128 K → 256 K). For teams that cannot or will not adopt MoE serving, Gemma 4 31B is the dense continuation of the Gemma 3 lineage.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 30.7 Bdense
Layers6050 sliding-window + 10 global (5:1)
AttentionGQA + QK-Norm + SWAinherited from Gemma 3
KV cache≈ 840 KiB/tokenlarge — dense attention on 60 layers
Max position262,144256 K native
Precisionbfloat16
Gemma 4 31B configuration (source: HuggingFace config.json)

840 KiB/Token KV Cache

Gemma 4 31B's KV cache at ≈ 840 KiB per token is the largest per-token footprint in this gallery, which is a direct consequence of running full-MHA-width attention across 60 layers on a dense stack. At the full 256 K native context this is ≈ 215 GiB of KV cache per sequence, which makes long-context serving expensive. The Gemma 4 31B dense variant is therefore best matched to shorter-context workloads where the per-token serving cost is amortized across reasonable sequence lengths; for 128 K+ context workloads, the Gemma 4 26B-A4B MoE variant is a much better economic fit.

5:1 Local-Global Preserved

Like Gemma 3 27B and Gemma 4 26B-A4B, Gemma 4 31B uses a 5:1 sliding-window to global ratio — 50 local layers + 10 global layers. This is the signature Gemma attention structure, chosen in Gemma 2 and scaled up with each release. Gemma 3 27B's dual-RoPE frequencies (local θ=10K, global θ=1M) and logit soft-capping (attention ≈ 50.0, output ≈ 30.0) both carry over to Gemma 4.

Verdict: The Dense Gemma Continuation

Gemma 4 31B is for teams that cannot adopt MoE serving: fine-tuning researchers working with standard dense-optimized frameworks, hardware environments without MoE kernel support, or anyone whose production pipeline is already tuned for Gemma 3 dense serving and does not want to migrate the expert-routing infrastructure. Architecturally it is a point upgrade to Gemma 3 27B with a doubled context window. For new deployments where MoE is on the table, Gemma 4 26B-A4B is the better economic choice.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.