Skip to content
MoEVerified
Mistral · 2026-03

Mistral Small 4

Multimodal Mistral Small refresh that jumps from the older dense 24B stack to an MLA-based sparse MoE design.

Mistral Small 4 decoder block architecture: Attention: MLA. Normalization: RMSNorm. FFN: Mixture of Experts (6.63B active parameters). Position encoding: RoPE. Scale: 119B, 256K context, 96 layers. Decoder type: MoE.

MLA·MoE · 6.63B active
6.63B active / 119B total|256K context|MLA|MoE

Architecture Specifications

Parameters6.63B active / 119B total
Context Window256K
Decoder TypeMoE
AttentionMLA
Active Parameters6.63B
Release Date2026-03
CategoryMixture of Experts
OrganizationMistral

Key Features

Multi-head Latent AttentionExpert routingLayer mix: 36 MLAKV cache: 22.5 KiB/token

Deep Dive

Overview

Mistral Small 4 is Mistral's March 2026 major-version bump to its 'small' tier. The name is misleading: at 119 B total / 6.63 B active parameters it is no longer 'small' in any literal sense — it is now a full sparse MoE, a dramatic departure from Mistral Small 3.1 24B's dense architecture. The other architectural shift is MLA: Mistral Small 4 drops GQA in favor of Multi-head Latent Attention, joining Mistral Large 3 in Mistral's MLA transition.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 119 BMoE — dense → MoE transition
Active parameters≈ 6.63 Bper token
Layers36all MLA
AttentionMLAMistral's second MLA release after Mistral Large 3
KV cache≈ 22.5 KiB/tokendramatic drop vs Mistral Small 3.1's 160 KiB
Max position262,144256 K native
Precisionbfloat16
Mistral Small 4 configuration (source: HuggingFace config.json)

Dense → MoE Transition

The Mistral Small tier started at 7 B dense (Mistral 7B), went to 12 B dense (Nemo), then 24 B dense (Small 3 / 3.1), and now jumps to 119 B total / 6.63 B active MoE. Per-token compute at 6.63 B is roughly a quarter of Mistral Small 3.1's 24 B dense active compute — so Small 4 actually serves faster than its predecessor while carrying 5× the total parameter count. This is the standard 'scale total parameters while keeping or shrinking active compute' MoE playbook.

MLA at Mid-Size

The adoption of MLA cuts KV-cache footprint from Mistral Small 3.1's 160 KiB/token to just 22.5 KiB/token — a 7× reduction. This is what makes the 256 K native context window economically serveable at this parameter count. With 22.5 KiB × 256 K ≈ 5.8 GiB of KV cache per sequence, Small 4 fits full-context workloads into single-GPU serving in a way Mistral Small 3.1 could only dream of.

Verdict: Small in Name Only

Mistral Small 4 is architecturally a new generation, not a point release. It abandons Mistral's dense-small heritage in favor of MoE serving economics, and adopts MLA to make long-context workloads practical. For teams whose production workload was Mistral Small 3.1 with 128 K context, Small 4 is a drop-in upgrade that roughly doubles the context window, cuts per-token serving cost, and raises total knowledge capacity 5×. The license should be checked — Small 4's release terms differ from Small 3.1's Apache 2.0.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.