GLM-5 744B
Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.
GLM-5 744B decoder block architecture: Attention: MLA + DeepSeek Sparse Attention. Normalization: RMSNorm. FFN: Mixture of Experts (40B active parameters). Position encoding: RoPE. Scale: 744B, 203K context, 78 layers. Decoder type: MoE.
Architecture Specifications
Key Features
Deep Dive
Overview
GLM-5 744B is Zhipu AI's flagship 2026 MoE — 744 B total parameters with ≈ 40 B active per token — and the clearest demonstration that DeepSeek V3's architectural template has become the default for frontier open-weight MoE. GLM-5 builds directly on V3's two most important ideas (Multi-head Latent Attention and fine-grained sparse MoE) and adds a DeepSeek-style sparse attention kernel on top of MLA to push context cost down further. The result is a 78-layer MLA stack with a KV cache of ≈ 87.8 KiB per token and a native 203 K context window.
This deep dive reads best after the DeepSeek V3 entry. If V3 is the template, GLM-5 is what you get when a different team scales the template roughly 1.1× in total parameters, 1.08× in active parameters, and wraps a sparse-attention optimization around the MLA path. Very little of the delta is novel; the interesting question is how the scaling choices shift the engineering trade-offs.
Architecture at a Glance
| Parameter | Value | Notes |
|---|---|---|
| Total parameters | 744 B | Zhipu release notes |
| Active parameters / token | 40 B | top-k sparse routing on fine-grained experts |
| Layers | 78 | all MLA — uniform stack (no hybrid) |
| Attention | MLA + DeepSeek Sparse Attention | MLA compresses KV, DSA sparsifies the softmax |
| Vocabulary | 155,136 | config.json |
| Context window | 203 K | post-trained extension |
| Decoder type | MoE | fine-grained + shared experts |
| KV cache / token | ≈ 87.8 KiB | 78 MLA layers dominate at this depth |
Attention: MLA + DeepSeek Sparse Attention
GLM-5's attention path combines two layers of optimization that operate on different dimensions of the cost function. Multi-head Latent Attention (MLA, introduced in DeepSeek-V2) compresses the key-value cache through a learned low-rank latent bottleneck, attacking the memory side of the attention bill. On top of that, DeepSeek Sparse Attention (DSA) sparsifies the softmax itself, attacking the compute side: instead of letting every query attend to every key in the window, DSA selects a top-k subset of keys per query and computes the softmax only over that subset. At a 203 K context, this matters because even with MLA's cache compression, the full softmax compute would still be quadratic in sequence length.
The 78-layer all-MLA depth is striking — deeper than DeepSeek V3's 61 layers, deeper than most contemporary MoEs. The extra depth is what produces the 87.8 KiB/token KV footprint even with MLA's latent compression: MLA is a per-layer saving, so a deeper model pays more per token even if each layer is cheap. This is the first place the GLM-5 / DeepSeek V3 comparison becomes non-trivial — GLM-5 trades a thicker stack for a slightly less extreme KV compression.
InsightMLA compresses memory, DSA compresses computeThink of the attention bill as a 2D grid: memory cost × compute cost. MLA pushes the memory axis down through latent compression. DSA pushes the compute axis down through top-k softmax sparsification. GLM-5 is the first major release to combine both — earlier MLA-only models (DeepSeek V2/V3) still pay full quadratic attention compute inside each layer.
Block Structure & MoE FFN
Each of the 78 blocks pairs an MLA + DSA attention module with a fine-grained sparse MoE feed-forward block. The MoE follows the DeepSeek recipe: many small routed experts, a small number of shared experts that fire unconditionally, and auxiliary-loss-free load balancing via online bias updates. The 40 B active / 744 B total sparsity ratio (≈ 5.4% active) is tighter than DeepSeek V3's ≈ 5.5%, meaning most of the scaling relative to V3 went into wider expert banks, not wider active compute.
def glm5_block(x, kv_cache):
# Attention path: MLA (cache compression) + DSA (softmax sparsification)
h = rms_norm(x)
h, kv_cache = mla_attention(
h,
kv_cache,
kv_lora_rank=..., # cache compression
sparse_topk=..., # DSA: only top-k keys per query
)
x = x + h
# Sparse MoE FFN (fine-grained + shared experts)
h = rms_norm(x)
x = x + deepseek_moe(h, top_k=..., shared_experts=...)
return x, kv_cacheEmbeddings and Tokenizer
GLM-5 uses a 155,136-entry vocabulary — slightly larger than DeepSeek V3's 129 K and reflecting Zhipu's stronger Chinese coverage. Position information is encoded through RoPE, using the MLA split-head pattern where a small number of head dimensions carry explicit rotary encoding while the rest operate on the compressed latent without RoPE. This is identical to DeepSeek V2/V3's approach and is what allows the MLA bottleneck to coexist with position-sensitive attention.
Context Window: 203 K
The 203 K context window is achieved through a combination of native long-context pretraining and post-hoc RoPE extrapolation. At that window and 87.8 KiB/token, the full KV cache is ≈ 17.4 GiB per sequence — larger than DeepSeek V3 at 128 K (≈ 8.8 GiB) because of both the longer context and the deeper stack. DSA's compute sparsification partially offsets the cost but does not reduce the cache itself.
How DeepSeek Sparse Attention Actually Works
DeepSeek Sparse Attention (DSA) is not new mathematics — it is a productized version of top-k attention, which the research community has been trying to make work since the original Reformer (Kitaev et al., 2020) and Longformer (Beltagy et al., 2020) papers. The trick that makes DSA different from those earlier attempts is the learned query-side index that decides which keys are worth attending to, rather than a fixed sliding window or locality-sensitive hash. For each query token, DSA scores every candidate key through a lightweight scoring head, selects the top-k scoring keys, and then runs the standard softmax attention only over that subset. This keeps the softmax numerically stable (you only normalize over tokens you actually attend to) while dropping the attention compute from quadratic to linear-in-k per query.
GLM-5's use of DSA on top of MLA's latent-compressed KV cache is the interesting composition. The scoring head reads the full-rank reconstructed keys (not the compressed latent) before the top-k selection, so the selection quality is not degraded by MLA's bottleneck. This is subtle but important: a naive implementation that scored the compressed latent directly would lose roughly half the effectiveness of DSA at 200 K+ contexts because the latent's low-rank projection discards exactly the fine-grained distinctions that top-k needs to make good selections.
Training
GLM-5 follows the Zhipu GLM family recipe updated for the V3-era MoE stack. Pretraining is on a bilingual Chinese/English corpus with heavy code, math, and long-context representation, using FP8-class mixed precision for the main compute path. Post-training retains Zhipu's preference for tool-use and agent supervision in the SFT stage, followed by a DPO alignment phase — the same broad recipe as DeepSeek V3 with Zhipu-specific data mixtures.
The 744 B scale pushes the pretraining curriculum harder than V3's 671 B in two specific places. First, the routing warmup has to be extended: with more total experts per layer and a deeper 78-layer stack, the auxiliary-loss-free load balancer takes longer to stabilize, and Zhipu reports using a longer warmup schedule than DeepSeek did for V3. Second, the long-context extension stage has to account for DSA's learned scoring head — the scoring head has to see long sequences during training for its top-k selection to generalize, so GLM-5's context expansion is staged more aggressively than a pure MLA model would need.
Post-training contributes Zhipu's signature strengths: heavy tool-use supervision, long-document summarization and extraction, and a strong Chinese-language evaluation focus. The released Instruct variant is what practitioners actually deploy — the base model is rarely used directly, consistent with the pattern established by DeepSeek V3 and the Qwen3 family.
The 78-layer depth choice is worth a close look because it is the single most consequential divergence from the DeepSeek V3 template. A deeper stack at the same hidden size increases per-token compute and memory cost roughly linearly — 78 / 61 ≈ 1.28× more work per token, which is not free. The trade Zhipu is making is that extra depth improves reasoning quality more than extra width at these scales, a hypothesis the GLM family has been pursuing since GLM-4. The cost shows up directly in the KV cache: at 87.8 KiB/token, GLM-5 pays roughly 28% more memory per token than V3's 68.6 KiB/token, with DSA's compute savings partially (but not fully) offsetting the memory hit. Whether this is a good trade depends on the workload: reasoning-heavy tasks benefit from the depth, while raw throughput-bound serving slightly prefers V3's shallower stack.
Verdict: DeepSeek V3, Scaled and Sharpened
GLM-5 is the clearest evidence that frontier open-weight MoE has converged on a narrow architectural ridge. Strip away the branding and what remains is: MLA for KV cache compression, fine-grained sparse MoE with a shared expert, auxiliary-loss-free routing, a DeepSeek-family tokenizer, and FP8 pretraining. GLM-5's single notable deviation is adding DeepSeek Sparse Attention on top of MLA, which attacks the compute side of the attention bill that MLA alone leaves untouched. If DeepSeek V3 defined the template, GLM-5 is evidence that the template is now a reusable platform — one that successive labs can drop their own scaling and optimization choices into without reinventing the base recipe.
NoteCompareRead GLM-5 against DeepSeek V3 directly. The architecture table and MoE recipe are nearly identical. The meaningful delta is depth (78 vs 61), scale (744 B vs 671 B), and the addition of DSA. Everything else is lineage.
References
Compare, evaluate, and deploy LLM architectures at scale
Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.