Sarvam 30B
Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.
Sarvam 30B decoder block architecture: Attention: GQA + QK-Norm with QK-Norm. Normalization: RMSNorm. FFN: Mixture of Experts (2.4B active parameters). Position encoding: RoPE. Scale: 30B, 131K context, 19 layers. Decoder type: MoE.
Architecture Specifications
Key Features
Deep Dive
Overview
Sarvam 30B is the smaller sibling to Sarvam 105B in Sarvam AI's March 2026 Indic-first release. At 30 B total / 2.4 B active, it is an unusually sparse MoE — roughly an 8% active-to-total ratio, sparser than Qwen3-30B-A3B (10% active) and GPT-OSS 20B (18% active). The intent: Indic-language serving at mobile / edge GPU scale, leveraging the MoE's small active compute to keep per-token cost cheap.
Architecture at a Glance
| Parameter | Value | Notes |
|---|---|---|
| Total parameters | ≈ 30 B | MoE |
| Active parameters | ≈ 2.4 B | per token — very sparse |
| Layers | 19 | shallow for this param count |
| Attention | GQA + QK-Norm | no MLA at this tier |
| KV cache | ≈ 19 KiB/token | |
| Max position | 131,072 | 128 K native |
| Vocabulary | ≈ 262,000 | same large Indic-coverage vocab as Sarvam 105B |
| Precision | bfloat16 |
2.4B Active: The Mobile-MoE Bet
Per-token compute cost at 2.4 B active puts Sarvam 30B in the same serving-cost band as a dense 2.4 B model (Llama 3.2 1B, SmolLM3 3B). But Sarvam 30B carries 12× the total parameters in its expert pool, which means roughly 12× the knowledge capacity. This is the 'small active, large total' MoE bet taken to an unusually sparse ratio — Sarvam is betting that at edge deployment scale, where per-token inference cost is the binding constraint, serving a sparse MoE is preferable to serving a dense model of comparable compute cost.
Same Indic Vocab as Sarvam 105B
The 262 K vocabulary is shared across the Sarvam family and is the key to the serving-cost story on Indic workloads. See the Sarvam 105B deep dive for a longer explanation of why large Indic vocabularies matter.
Verdict: Indic Edge MoE
Sarvam 30B is Sarvam AI's edge-tier pick for Indic-language workloads that cannot afford frontier serving cost. At 2.4 B active compute per token, it should fit into consumer-GPU serving budgets while providing meaningful knowledge capacity in long-tail Indic languages. For general-purpose 30B use, Qwen3-30B-A3B is stronger; for Indic workloads at mobile scale, Sarvam 30B is the default.
References
Compare, evaluate, and deploy LLM architectures at scale
Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.