Skip to content
MoEVerified
Unknown · 2026-03

Sarvam 30B

Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.

Sarvam 30B decoder block architecture: Attention: GQA + QK-Norm with QK-Norm. Normalization: RMSNorm. FFN: Mixture of Experts (2.4B active parameters). Position encoding: RoPE. Scale: 30B, 131K context, 19 layers. Decoder type: MoE.

GQA + QK-Norm·MoE · 2.4B active
2.4B active / 30B total|131K context|GQA + QK-Norm|MoE

Architecture Specifications

Parameters2.4B active / 30B total
Context Window131K
Decoder TypeMoE
AttentionGQA + QK-Norm
Active Parameters2.4B
Layers19
Hidden Size4,096
Vocabulary Size262K
Release Date2026-03
CategoryMixture of Experts
OrganizationUnknown

Key Features

Grouped Query AttentionQK normalizationExpert routingLayer mix: 19 GQAKV cache: 19 KiB/token

Deep Dive

Overview

Sarvam 30B is the smaller sibling to Sarvam 105B in Sarvam AI's March 2026 Indic-first release. At 30 B total / 2.4 B active, it is an unusually sparse MoE — roughly an 8% active-to-total ratio, sparser than Qwen3-30B-A3B (10% active) and GPT-OSS 20B (18% active). The intent: Indic-language serving at mobile / edge GPU scale, leveraging the MoE's small active compute to keep per-token cost cheap.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 30 BMoE
Active parameters≈ 2.4 Bper token — very sparse
Layers19shallow for this param count
AttentionGQA + QK-Normno MLA at this tier
KV cache≈ 19 KiB/token
Max position131,072128 K native
Vocabulary≈ 262,000same large Indic-coverage vocab as Sarvam 105B
Precisionbfloat16
Sarvam 30B configuration (source: HuggingFace config.json)

2.4B Active: The Mobile-MoE Bet

Per-token compute cost at 2.4 B active puts Sarvam 30B in the same serving-cost band as a dense 2.4 B model (Llama 3.2 1B, SmolLM3 3B). But Sarvam 30B carries 12× the total parameters in its expert pool, which means roughly 12× the knowledge capacity. This is the 'small active, large total' MoE bet taken to an unusually sparse ratio — Sarvam is betting that at edge deployment scale, where per-token inference cost is the binding constraint, serving a sparse MoE is preferable to serving a dense model of comparable compute cost.

Same Indic Vocab as Sarvam 105B

The 262 K vocabulary is shared across the Sarvam family and is the key to the serving-cost story on Indic workloads. See the Sarvam 105B deep dive for a longer explanation of why large Indic vocabularies matter.

Verdict: Indic Edge MoE

Sarvam 30B is Sarvam AI's edge-tier pick for Indic-language workloads that cannot afford frontier serving cost. At 2.4 B active compute per token, it should fit into consumer-GPU serving budgets while providing meaningful knowledge capacity in long-tail Indic languages. For general-purpose 30B use, Qwen3-30B-A3B is stronger; for Indic workloads at mobile scale, Sarvam 30B is the default.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.