Skip to content
MoEVerified
Unknown · 2026-03

Sarvam 105B

Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.

Sarvam 105B decoder block architecture: Attention: MLA + KV LayerNorm + NoPE + RoPE. Normalization: RMSNorm. FFN: Mixture of Experts (10.3B active parameters). Position encoding: RoPE. Scale: 105B, 131K context, 32 layers. Decoder type: MoE.

MLA + KV LayerNorm + NoPE + RoPE·MoE · 10.3B active
10.3B active / 105B total|131K context|MLA + KV LayerNorm + NoPE + RoPE|MoE

Architecture Specifications

Parameters10.3B active / 105B total
Context Window131K
Decoder TypeMoE
AttentionMLA + KV LayerNorm + NoPE + RoPE
Active Parameters10.3B
Layers32
Hidden Size4,096
Vocabulary Size262K
Release Date2026-03
CategoryMixture of Experts
OrganizationUnknown

Key Features

Multi-head Latent AttentionExpert routingLayer mix: 32 MLAKV cache: 36 KiB/token

Deep Dive

Overview

Sarvam 105B is the larger of Sarvam AI's March 2026 pair of open-weight MoEs, targeted at Indic-language workloads. At 105 B total / 10.3 B active, it is a mid-tier MoE with one of the more unusual attention stacks in this gallery: MLA + KV LayerNorm + NoPE + RoPE mixed. The config.json shows 32 MLA layers, a 262 K vocabulary (one of the largest in this gallery — explicitly sized for Indic scripts), and a 131 K context window.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 105 BMoE
Active parameters≈ 10.3 Bper token
Layers32all MLA
AttentionMLA + KV LayerNorm + NoPE + RoPEhybrid position scheme
KV cache≈ 36 KiB/tokenMLA compression
Max position131,072128 K native
Vocabulary≈ 262,000large — sized for Indic scripts
Precisionbfloat16
Sarvam 105B configuration (source: HuggingFace config.json)

NoPE + RoPE Hybrid

Sarvam 105B is one of the few models in this gallery to ship a hybrid NoPE + RoPE positional scheme: some layers use no positional information at all, others use standard RoPE. 'NoPE' (no positional encoding) layers let the model rely purely on content-based attention patterns, which research in 2024 showed can be surprisingly effective for in-context learning tasks. Mixing NoPE with RoPE layers is an attempt to get the best of both: content-addressable memory on some layers, relative-position awareness on others.

The config also adds KV LayerNorm — a LayerNorm applied to keys and values before the attention operation. This is an attention-stability trick in the same family as QK-Norm, pioneered in a handful of open-weight models but not yet standard.

262K Vocabulary for Indic Coverage

The 262 K vocabulary is twice the size of most peer models (Llama 3 ships 128 K, Qwen3 ships 152 K). Sarvam's pitch: Indic scripts (Devanagari, Tamil, Bengali, etc.) need dense tokenizer coverage to avoid the byte-per-token penalty that hits English-optimized BPE tokenizers on non-Latin scripts. At 262 K tokens, a Sarvam tokenization of Hindi or Tamil text is roughly 2–3× more compressive than the same text under Llama 3's tokenizer, which directly translates to better effective context and lower inference cost on Indic workloads.

Verdict: The Indic-First Frontier

Sarvam 105B is the default pick for Indic-language production workloads at the 100 B-class tier. Its architectural novelty is in the position-encoding hybrid and the large vocabulary, both directly serving the Indic-coverage mission. For English-only use cases, Qwen3-235B-A22B or DeepSeek-V3 benchmark higher. For Indic-first teams that cannot use closed Indian models (e.g., Krutrim, Bhashini), Sarvam 105B is the strongest open-weight option.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.