Skip to content
MoEVerified
Unknown · 2026-02

MiniMax M2.5 230B

Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.

MiniMax M2.5 230B decoder block architecture: Attention: GQA + QK-Norm with QK-Norm. Normalization: RMSNorm. FFN: Mixture of Experts (10B active parameters). Position encoding: RoPE. Scale: 230B, 197K context, 62 layers. Decoder type: MoE.

GQA + QK-Norm·MoE · 10B active
10B active / 230B total|197K context|GQA + QK-Norm|MoE

Architecture Specifications

Parameters10B active / 230B total
Context Window197K
Decoder TypeMoE
AttentionGQA + QK-Norm
Active Parameters10B
Layers62
Hidden Size3,072
Vocabulary Size200K
Release Date2026-02
CategoryMixture of Experts
OrganizationUnknown

Key Features

Grouped Query AttentionQK normalizationLayer mix: 62 GQAKV cache: 248 KiB/token

Deep Dive

Overview

MiniMax M2.5 is MiniMax's February 2026 refresh of MiniMax M2 — same 230 B total / 10 B active MoE architecture, same 62-layer GQA stack, same QK-Norm attention stability trick. The primary delta versus M2 is in post-training and the data mix, not architecture. Read the MiniMax M2 230B deep dive first — this one covers only the deltas.

What Changed from M2

  • Partial RoPE removed: M2.5's config.json applies full RoPE to every query/key dimension, dropping M2's partial-RoPE experiment. MiniMax apparently found the content-based dimensions weren't pulling their weight at scale.
  • Post-training refresh: updated agentic and reasoning SFT mixes.
  • Architecture otherwise unchanged: 62 layers, 10 B active, ≈ 248 KiB/token KV cache, 197 K native context.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 230 BMoE — same as M2
Active parameters≈ 10 Bper token
Layers62all GQA
AttentionGQA + QK-Normpartial RoPE dropped
KV cache≈ 248 KiB/token
Max position≈ 201,728197 K native
Precisionbfloat16
MiniMax M2.5 230B configuration (source: HuggingFace config.json)

Verdict: M2 + Better Post-Training

MiniMax M2.5 is a point release targeted at improving downstream quality rather than architectural efficiency. For teams already running M2 the upgrade is a drop-in. The quiet interesting signal is the removal of partial RoPE: MiniMax was one of the few labs experimenting with partial positional encoding at frontier scale, and rolling it back suggests the research community's 'full RoPE is enough' consensus was correct for this scale band.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.