Skip to content
MoEVerified
NVIDIA · 2026-03

Nemotron 3 Super 120B-A12B

The Super variant scales up Nano and adds both latent experts and native speculative decoding support.

Nemotron 3 Super 120B-A12B decoder block architecture: Attention: Mostly Mamba-2 + a few GQA layers. Normalization: RMSNorm. FFN: Mixture of Experts (12B active parameters). Position encoding: RoPE. Scale: 120B, 1M context, 88 layers. Decoder type: MoE.

Mostly Mamba-2 + a few GQA layers·MoE · 12B active
12B active / 120B total|1M context|Mostly Mamba-2 + a few GQA layers|MoE

Architecture Specifications

Parameters12B active / 120B total
Context Window1M
Decoder TypeMoE
AttentionMostly Mamba-2 + a few GQA layers
Active Parameters12B
Layers88
Hidden Size4,096
Vocabulary Size131K
Release Date2026-03
CategoryHybrid Architecture
OrganizationNVIDIA

Key Features

Grouped Query AttentionExpert routingLayer mix: 8 GQA + 40 Mamba-2 + 40 MoEKV cache: 8 KiB/token

Deep Dive

Overview

Nemotron 3 Super 120B-A12B is the flagship of NVIDIA's Nemotron 3 family, released March 2026. At 120 B total / 12 B active parameters it is the largest hybrid-Mamba-2 model in this gallery. The Nemotron 3 family argument — that SSM layers should dominate the sequence-mixing stack — scales all the way up to 120 B total parameters here without reverting to attention. Read the Nemotron 3 Nano 30B-A3B deep dive first for context on the family's core bet.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 120 BMoE (hybrid)
Active parameters≈ 12 Bper token
Layer mix8 GQA + 40 Mamba-2 + 40 MoE88 layers total
Attention layers8only ~17% of sequence-mixing layers
KV cache≈ 8 KiB/tokentiny — dramatic vs any pure-attention peer
Max position1,048,5761 M native
Vocabulary≈ 131,000
Precisionbfloat16
Nemotron 3 Super 120B-A12B configuration (source: HuggingFace config.json)

1M Context at 8 KiB KV per Token

The headline number: Nemotron 3 Super's KV cache footprint at 1 M native context is roughly 8 KiB per token, producing ≈ 8 GiB of KV cache at full context. Compare this to Llama 4 Maverick at 1 M context (~192 GiB of KV cache per the Maverick deep dive) and the scale of the savings is obvious. This is not a marginal optimization — it is a categorically different serving economics story.

The tradeoff is empirical: Mamba-2 hybrids have not yet proven to match full-attention transformers on every capability axis, particularly on pinpoint long-range retrieval (the classic needle-in-haystack test). NVIDIA's tech report argues the gap is small and closing at the 120 B scale, but benchmark your specific workload before committing.

MoE: 40 Routed Expert Layers

Of the 88 total layers, 40 are MoE feed-forward layers (interleaved with the Mamba-2 + GQA stack), giving Nemotron 3 Super its 120 B total / 12 B active budget. The MoE design is otherwise conventional — NVIDIA's architectural innovation is concentrated in the sequence-mixing stack, not the FFN layer. Active compute per token at ≈ 12 B puts per-token serving cost in the same band as Qwen3-30B-A3B or Llama 4 Scout.

Verdict: The SSM Frontier

Nemotron 3 Super 120B-A12B is the most aggressive SSM-first frontier model in the open ecosystem. For workloads where 1 M context serving cost is the dominant constraint — long-document analysis, whole-codebase agentic tasks, multi-hour streaming inference — it is the default choice. For standard short-context reasoning it is competitive with pure-attention peers but not obviously ahead, so the case for adoption rests on the serving-cost story.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.