Skip to content
DenseVerified
Unknown · 2026-02

Nanbeige 4.1 3B

Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.

Nanbeige 4.1 3B decoder block architecture: Attention: GQA. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 3B, 262K context, 32 layers. Decoder type: Dense.

GQA·SwiGLU
3B|262K context|GQA|Dense

Architecture Specifications

Parameters3B
Context Window262K
Decoder TypeDense
AttentionGQA
Layers32
Hidden Size2,560
Vocabulary Size166K
Release Date2026-02
CategoryEfficient & Small
OrganizationUnknown

Key Features

Grouped Query AttentionLayer mix: 32 GQAKV cache: 64 KiB/token

Deep Dive

Overview

Nanbeige 4.1 3B is a 3 B dense decoder from the Nanbeige team, released February 2026. At this size it competes with Llama 3.2 3B, SmolLM3 3B, and Qwen3-4B in the edge-deployment tier. The distinctive feature per the config.json is the 262 K native context window, which is dramatically larger than any other 3B-class model in this gallery (Llama 3.2 3B ships 128 K, Qwen3-4B ships 128 K).

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 3 Bdense
Layers32all GQA
AttentionGQAstandard grouped-query
KV cache≈ 64 KiB/token
Max position262,144256 K native — unusually long for 3B
Vocabulary≈ 166,000
NormalizationRMSNormpre-norm
ActivationSiLU (SwiGLU)
Precisionbfloat16
Nanbeige 4.1 3B configuration (source: HuggingFace config.json)

256K Context at 3B

Serving a 256 K context window on a 3 B model is unusual because the KV cache cost scales linearly with context length while the weight cost stays fixed. At 64 KiB/token × 256 K ≈ 16 GiB of KV cache, which exceeds the ≈ 6 GB bf16 weight footprint by almost 3×. The Nanbeige team's bet is that edge deployments with abundant GPU RAM but tight per-token inference cost (e.g. long-document summarization on a workstation GPU) benefit from this tradeoff.

Verdict: The Long-Context Edge Model

Nanbeige 4.1 3B is the longest-context 3B-class open weight model in the gallery. Architecturally it is conservative (plain GQA, no SWA, no thinking mode). The value is the context length: if your workload is 'summarize very long documents cheaply on a single GPU', this is the default pick. For general-purpose 3B use, Llama 3.2 3B and Qwen3-4B are stronger and have broader tooling support.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.