Skip to content
MoEVerified
Alibaba · 2026-02

Qwen3.5 397B

Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.

Qwen3.5 397B decoder block architecture: Attention: 3:1 Gated DeltaNet + Gated Attn. Normalization: RMSNorm. FFN: Mixture of Experts (17B active parameters). Position encoding: RoPE. Scale: 397B, 262K context, 128 layers. Decoder type: MoE.

3:1 Gated DeltaNet + Gated Attn·MoE · 17B active
17B active / 397B total|262K context|3:1 Gated DeltaNet + Gated Attn|MoE

Architecture Specifications

Parameters17B active / 397B total
Context Window262K
Decoder TypeMoE
Attention3:1 Gated DeltaNet + Gated Attn
Active Parameters17B
Release Date2026-02
CategoryHybrid Architecture
OrganizationAlibaba

Key Features

Expert routingLayer mix: 15 gated attention + 45 DeltaNetKV cache: 30 KiB/token

Deep Dive

Overview

Qwen3.5 397B-A17B is Alibaba's February 2026 follow-up to the Qwen3 series. At 397 B total / 17 B active parameters it sits between Qwen3-235B-A22B (the Qwen3 flagship) and frontier MoEs like DeepSeek V3.2 (671 B). The architectural leap is in the attention stack: Qwen3.5 drops standard GQA entirely in favor of a 3:1 Gated DeltaNet + Gated Attention hybrid. This is the same direction Qwen3-Next 80B-A3B took, now scaled to near-frontier total parameters.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 397 BMoE
Active parameters≈ 17 Bper token
Layer mix15 gated attention + 45 DeltaNet60 total layers
AttentionGated DeltaNet + Gated Attention3:1 DeltaNet:attention ratio
KV cache≈ 30 KiB/tokenvery small thanks to DeltaNet dominance
Max position262,144256 K native
Precisionbfloat16
Qwen3.5 397B configuration (source: HuggingFace config.json)

DeltaNet: Linear Attention, Linearized

DeltaNet is a linear-attention variant in the same research family as Mamba-2, RWKV-7, and Lightning Attention. Unlike classic linear attention, DeltaNet uses a 'delta rule' state update — at each token, the model computes a small correction to the current state rather than a full rewrite. This gives the model tighter control over how its recurrent memory state evolves over long sequences. Qwen3.5's 3:1 ratio means 75% of layers are DeltaNet, paying only linear-in-sequence cost, while the remaining 25% pay full quadratic attention cost.

This is architecturally the same direction as Kimi Linear 48B-A3B, scaled up to near-frontier parameter count. Read the Kimi Linear deep dive for a longer explanation of the linear-attention / full-attention hybrid tradeoff. The KV cache drops to ≈ 30 KiB/token, which is 3–10× smaller than similarly-sized pure-attention MoEs.

Gated Attention

The 15 full-attention layers are gated attention: standard attention output multiplied elementwise by a sigmoid-gated projection of the input. This acts as a soft gate that can suppress the attention contribution per-token. Combined with DeltaNet, the full-attention layers can focus their (limited) budget on the most globally-relevant patterns without wasting capacity on easy cases.

Verdict: Qwen Goes Linear at Frontier Scale

Qwen3.5 397B is the largest public linear-attention hybrid as of early 2026 — bigger than Kimi Linear by an order of magnitude. It is the clearest signal that Alibaba is committing to linear attention as the scaling path for serving efficiency at frontier scale, rather than betting on MLA (DeepSeek) or sparse attention (DeepSeek V3.2). For long-context serving economics, it should be benchmarked directly against Ling 2.5 1T and Kimi Linear.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.