Skip to content
HybridVerified
NVIDIA · 2026-03

Nemotron 3 Nano 4B

Compact on-device hybrid that compresses Nemotron Nano 9B v2 into a mostly Mamba-2 stack with only four attention layers.

Nemotron 3 Nano 4B decoder block architecture: Attention: GQA + only 4 attention layers. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 4B, 262K context, 42 layers. Decoder type: Hybrid.

GQA + only 4 attention layers·SwiGLU
4B|262K context|GQA + only 4 attention layers|Hybrid

Architecture Specifications

Parameters4B
Context Window262K
Decoder TypeHybrid
AttentionGQA + only 4 attention layers
Layers42
Hidden Size3,136
Vocabulary Size131K
Release Date2026-03
CategoryHybrid Architecture
OrganizationNVIDIA

Key Features

Grouped Query AttentionLayer mix: 4 GQA + 21 Mamba-2 + 17 FFNKV cache: 16 KiB/token

Deep Dive

Overview

Nemotron 3 Nano 4B is NVIDIA's 4 B hybrid Mamba-2 + transformer dense decoder, released March 2026 as the smallest member of the Nemotron 3 family. Unlike the larger Nemotron 3 Nano 30B-A3B (which is MoE), the 4 B is dense but keeps the family's signature architectural bet: state-space-model layers dominate the sequence-mixing stack, with attention reserved for a handful of key positions.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 4 Bdense (hybrid)
Layer mix4 GQA + 21 Mamba-2 + 17 FFNattention is a tiny fraction
Attention layers4only 4 attention layers in the entire model
KV cache≈ 16 KiB/tokentiny — only the 4 GQA layers contribute
Max position262,144256 K native
Vocabulary≈ 131,000
Precisionbfloat16
Nemotron 3 Nano 4B configuration (source: HuggingFace config.json)

Only 4 Attention Layers

Nemotron 3 Nano 4B is the clearest expression of NVIDIA's 'attention is a seasoning, not a staple' bet. Only 4 of 25 sequence-mixing layers use attention — the other 21 are Mamba-2 state-space model layers, each with a fixed recurrent state instead of a growing KV cache. The 4 attention layers are placed at strategic positions in the stack (early for tokenization resolution, periodically throughout for cross-token information routing) rather than uniformly interleaved.

The serving consequence: at 256 K context, this 4 B model uses roughly the same long-context memory as a 500 M pure-attention model would. It is by far the cheapest long-context option in the 3B–4B band. The research-risk side is the same as the 30B-A3B variant: Mamba-2 inference kernels need NVIDIA's own runtime to hit peak throughput.

Verdict: SSM-First 4B

Nemotron 3 Nano 4B is the most SSM-heavy small model in the gallery. For edge deployments that need long-document processing (codebases, documents, agentic workloads) on hardware that cannot afford per-token KV-cache growth, this is the default research pick. For general-purpose 4B use where ecosystem support matters more than long-context serving cost, Llama 3.2 3B or Qwen3-4B are safer bets.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.