Nemotron 3 Super 120B-A12B

Architecture Specifications

Parameters12B active / 120B total

Context Window1M

Decoder TypeMoE

AttentionMostly Mamba-2 + a few GQA layers

Active Parameters12B

Layers88

Hidden Size4,096

Vocabulary Size131K

Release Date2026-03

CategoryHybrid Architecture

OrganizationNVIDIA

Key Features

Grouped Query AttentionExpert routingLayer mix: 8 GQA + 40 Mamba-2 + 40 MoEKV cache: 8 KiB/token

Deep Dive

Overview

Nemotron 3 Super 120B-A12B is the flagship of NVIDIA's Nemotron 3 family, released March 2026. At 120 B total / 12 B active parameters it is the largest hybrid-Mamba-2 model in this gallery. The Nemotron 3 family argument — that SSM layers should dominate the sequence-mixing stack — scales all the way up to 120 B total parameters here without reverting to attention. Read the Nemotron 3 Nano 30B-A3B deep dive first for context on the family's core bet.

Architecture at a Glance

Parameter	Value	Notes
Total parameters	≈ 120 B	MoE (hybrid)
Active parameters	≈ 12 B	per token
Layer mix	8 GQA + 40 Mamba-2 + 40 MoE	88 layers total
Attention layers	8	only ~17% of sequence-mixing layers
KV cache	≈ 8 KiB/token	tiny — dramatic vs any pure-attention peer
Max position	1,048,576	1 M native
Vocabulary	≈ 131,000
Precision	bfloat16

Nemotron 3 Super 120B-A12B configuration (source: HuggingFace config.json)

1M Context at 8 KiB KV per Token

The headline number: Nemotron 3 Super's KV cache footprint at 1 M native context is roughly 8 KiB per token, producing ≈ 8 GiB of KV cache at full context. Compare this to Llama 4 Maverick at 1 M context (~192 GiB of KV cache per the Maverick deep dive) and the scale of the savings is obvious. This is not a marginal optimization — it is a categorically different serving economics story.

The tradeoff is empirical: Mamba-2 hybrids have not yet proven to match full-attention transformers on every capability axis, particularly on pinpoint long-range retrieval (the classic needle-in-haystack test). NVIDIA's tech report argues the gap is small and closing at the 120 B scale, but benchmark your specific workload before committing.

MoE: 40 Routed Expert Layers

Of the 88 total layers, 40 are MoE feed-forward layers (interleaved with the Mamba-2 + GQA stack), giving Nemotron 3 Super its 120 B total / 12 B active budget. The MoE design is otherwise conventional — NVIDIA's architectural innovation is concentrated in the sequence-mixing stack, not the FFN layer. Active compute per token at ≈ 12 B puts per-token serving cost in the same band as Qwen3-30B-A3B or Llama 4 Scout.

Verdict: The SSM Frontier

Nemotron 3 Super 120B-A12B is the most aggressive SSM-first frontier model in the open ecosystem. For workloads where 1 M context serving cost is the dominant constraint — long-document analysis, whole-codebase agentic tasks, multi-hour streaming inference — it is the default choice. For standard short-context reasoning it is competitive with pure-attention peers but not obviously ahead, so the case for adoption rests on the serving-cost story.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Request demo Back to gallery

Nemotron 3 Super 120B-A12BNemotron 3 Super 120B-A12B