Skip to content
MoEVerified
Google · 2026-04

Gemma 4 26B-A4B

Sparse Gemma 4 variant that keeps the local:global attention backbone while swapping dense FFNs for MoE layers.

Gemma 4 26B-A4B decoder block architecture: Attention: GQA + QK-Norm + SWA with QK-Norm with Sliding Window Attention. Normalization: RMSNorm. FFN: Mixture of Experts (3.8B active parameters). Position encoding: RoPE. Scale: 25.2B, 256K context, 40 layers. Decoder type: MoE.

GQA + QK-Norm + SWA·MoE · 3.8B active
3.8B active / 25.2B total|256K context|GQA + QK-Norm + SWA|MoE

Architecture Specifications

Parameters3.8B active / 25.2B total
Context Window256K
Decoder TypeMoE
AttentionGQA + QK-Norm + SWA
Active Parameters3.8B
Vocabulary Size262K
Release Date2026-04
CategoryMixture of Experts
OrganizationGoogle

Key Features

Grouped Query AttentionSliding Window AttentionQK normalizationExpert routingLayer mix: 25 sliding-window + 5 globalKV cache: 210 KiB/token

Deep Dive

Overview

Gemma 4 26B-A4B is Google's first MoE release in the Gemma line, published April 2026. At 25.2 B total / 3.8 B active it is a sparse MoE in the same band as Qwen3-30B-A3B. The Gemma 3 → Gemma 4 transition is therefore a dense → MoE generational shift that mirrors what Mistral did with Small 4. Structural inheritances from Gemma 3 — local-global attention interleave, QK-Norm, large vocabulary — are preserved, making this a natural A/B partner for the Gemma 3 27B deep dive.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 25.2 BMoE — Gemma's first
Active parameters≈ 3.8 Bper token
Layers3025 sliding-window + 5 global (5:1)
AttentionGQA + QK-Norm + SWAinherited from Gemma 3
KV cache≈ 210 KiB/token
Max position262,144256 K native — double Gemma 3's 128 K
Precisionbfloat16
Gemma 4 26B-A4B configuration (source: HuggingFace config.json)

Google's First Gemma MoE

Gemma 1, Gemma 2, and Gemma 3 were all dense. Gemma 4 26B-A4B is the first MoE release in the Gemma family, alongside a dense Gemma 4 31B variant. Per-token compute at 3.8 B active is lower than Gemma 3 27B's 27 B dense compute by nearly 7×, so Gemma 4 26B-A4B serves dramatically faster than Gemma 3 27B while carrying slightly less total capacity. This is a deliberate tradeoff: Google is betting that for Gemma's typical deployment context (single-GPU inference on consumer hardware), serving speed matters more than peak quality.

Local-Global 5:1 Preserved

The 5:1 sliding-window to global attention ratio — Gemma 3's signature choice — carries over unchanged to Gemma 4. Only 5 of 30 layers attend to the full 256 K context, which keeps per-token KV cost manageable even as the context window doubled from 128 K to 256 K. QK-Norm on queries and keys is also preserved, as is Gemma's distinctive logit soft-capping.

Verdict: Gemma Goes MoE

Gemma 4 26B-A4B is architecturally conservative within the Gemma family — it inherits every Gemma 3 design choice and adds MoE on top — but strategically significant as Google's signal that MoE is now the default for Gemma deployments. For teams running Gemma 3 27B in production, Gemma 4 26B-A4B is a drop-in upgrade that roughly doubles serving throughput and doubles context window while preserving the same local-global attention semantics.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.