Nemotron 3 Nano 4B

Architecture Specifications

Parameters4B

Context Window262K

Decoder TypeHybrid

AttentionGQA + only 4 attention layers

Layers42

Hidden Size3,136

Vocabulary Size131K

Release Date2026-03

CategoryHybrid Architecture

OrganizationNVIDIA

Key Features

Grouped Query AttentionLayer mix: 4 GQA + 21 Mamba-2 + 17 FFNKV cache: 16 KiB/token

Deep Dive

Overview

Nemotron 3 Nano 4B is NVIDIA's 4 B hybrid Mamba-2 + transformer dense decoder, released March 2026 as the smallest member of the Nemotron 3 family. Unlike the larger Nemotron 3 Nano 30B-A3B (which is MoE), the 4 B is dense but keeps the family's signature architectural bet: state-space-model layers dominate the sequence-mixing stack, with attention reserved for a handful of key positions.

Architecture at a Glance

Parameter	Value	Notes
Total parameters	≈ 4 B	dense (hybrid)
Layer mix	4 GQA + 21 Mamba-2 + 17 FFN	attention is a tiny fraction
Attention layers	4	only 4 attention layers in the entire model
KV cache	≈ 16 KiB/token	tiny — only the 4 GQA layers contribute
Max position	262,144	256 K native
Vocabulary	≈ 131,000
Precision	bfloat16

Nemotron 3 Nano 4B configuration (source: HuggingFace config.json)

Only 4 Attention Layers

Nemotron 3 Nano 4B is the clearest expression of NVIDIA's 'attention is a seasoning, not a staple' bet. Only 4 of 25 sequence-mixing layers use attention — the other 21 are Mamba-2 state-space model layers, each with a fixed recurrent state instead of a growing KV cache. The 4 attention layers are placed at strategic positions in the stack (early for tokenization resolution, periodically throughout for cross-token information routing) rather than uniformly interleaved.

The serving consequence: at 256 K context, this 4 B model uses roughly the same long-context memory as a 500 M pure-attention model would. It is by far the cheapest long-context option in the 3B–4B band. The research-risk side is the same as the 30B-A3B variant: Mamba-2 inference kernels need NVIDIA's own runtime to hit peak throughput.

Verdict: SSM-First 4B

Nemotron 3 Nano 4B is the most SSM-heavy small model in the gallery. For edge deployments that need long-document processing (codebases, documents, agentic workloads) on hardware that cannot afford per-token KV-cache growth, this is the default research pick. For general-purpose 4B use where ecosystem support matters more than long-context serving cost, Llama 3.2 3B or Qwen3-4B are safer bets.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Request demo Back to gallery

Nemotron 3 Nano 4BNemotron 3 Nano 4B