Sarvam 105B
Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.
Sarvam 105B decoder block architecture: Attention: MLA + KV LayerNorm + NoPE + RoPE. Normalization: RMSNorm. FFN: Mixture of Experts (10.3B active parameters). Position encoding: RoPE. Scale: 105B, 131K context, 32 layers. Decoder type: MoE.
Architecture Specifications
Key Features
Deep Dive
Overview
Sarvam 105B is the larger of Sarvam AI's March 2026 pair of open-weight MoEs, targeted at Indic-language workloads. At 105 B total / 10.3 B active, it is a mid-tier MoE with one of the more unusual attention stacks in this gallery: MLA + KV LayerNorm + NoPE + RoPE mixed. The config.json shows 32 MLA layers, a 262 K vocabulary (one of the largest in this gallery — explicitly sized for Indic scripts), and a 131 K context window.
Architecture at a Glance
| Parameter | Value | Notes |
|---|---|---|
| Total parameters | ≈ 105 B | MoE |
| Active parameters | ≈ 10.3 B | per token |
| Layers | 32 | all MLA |
| Attention | MLA + KV LayerNorm + NoPE + RoPE | hybrid position scheme |
| KV cache | ≈ 36 KiB/token | MLA compression |
| Max position | 131,072 | 128 K native |
| Vocabulary | ≈ 262,000 | large — sized for Indic scripts |
| Precision | bfloat16 |
NoPE + RoPE Hybrid
Sarvam 105B is one of the few models in this gallery to ship a hybrid NoPE + RoPE positional scheme: some layers use no positional information at all, others use standard RoPE. 'NoPE' (no positional encoding) layers let the model rely purely on content-based attention patterns, which research in 2024 showed can be surprisingly effective for in-context learning tasks. Mixing NoPE with RoPE layers is an attempt to get the best of both: content-addressable memory on some layers, relative-position awareness on others.
The config also adds KV LayerNorm — a LayerNorm applied to keys and values before the attention operation. This is an attention-stability trick in the same family as QK-Norm, pioneered in a handful of open-weight models but not yet standard.
262K Vocabulary for Indic Coverage
The 262 K vocabulary is twice the size of most peer models (Llama 3 ships 128 K, Qwen3 ships 152 K). Sarvam's pitch: Indic scripts (Devanagari, Tamil, Bengali, etc.) need dense tokenizer coverage to avoid the byte-per-token penalty that hits English-optimized BPE tokenizers on non-Latin scripts. At 262 K tokens, a Sarvam tokenization of Hindi or Tamil text is roughly 2–3× more compressive than the same text under Llama 3's tokenizer, which directly translates to better effective context and lower inference cost on Indic workloads.
Verdict: The Indic-First Frontier
Sarvam 105B is the default pick for Indic-language production workloads at the 100 B-class tier. Its architectural novelty is in the position-encoding hybrid and the large vocabulary, both directly serving the Indic-coverage mission. For English-only use cases, Qwen3-235B-A22B or DeepSeek-V3 benchmark higher. For Indic-first teams that cannot use closed Indian models (e.g., Krutrim, Bhashini), Sarvam 105B is the strongest open-weight option.
References
Compare, evaluate, and deploy LLM architectures at scale
Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.