Skip to content
DenseVerified
Cohere · 2026-02

Tiny Aya 3.35B

Compact multilingual model from Cohere with a rare parallel transformer block.

Tiny Aya 3.35B decoder block architecture: Attention: GQA + 3:1 SWA attention with Sliding Window Attention. Normalization: RMSNorm. FFN: SwiGLU. Position encoding: RoPE. Scale: 3.35B, 8K context, 24 layers. Decoder type: Dense.

GQA + 3:1 SWA attention·SwiGLU
3.35B|8K context|GQA + 3:1 SWA attention|Dense

Architecture Specifications

Parameters3.35B
Context Window8K
Decoder TypeDense
AttentionGQA + 3:1 SWA attention
Release Date2026-02
CategoryEfficient & Small
OrganizationCohere

Key Features

Grouped Query AttentionSliding Window AttentionRoPE embeddingsLayer mix: 27 sliding-window + 9 globalKV cache: 72 KiB/token

Deep Dive

Overview

Tiny Aya 3.35B is Cohere's February 2026 multilingual-first small model, part of the Aya research line that has historically focused on coverage across long-tail languages. At 3.35 B dense parameters, it competes with Llama 3.2 3B, SmolLM3 3B, and Qwen3-4B in the edge-tier band. The distinctive design choices per the shipped config.json are a 3:1 sliding-window to global attention ratio and a very tight 8 K context window — one of the smallest in any 2026 release.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 3.35 Bdense
Layers3627 sliding-window + 9 global (3:1)
AttentionGQA + 3:1 SWA + RoPE
KV cache≈ 72 KiB/token
Max position8,1928 K native — intentionally small
Precisionbfloat16
Tiny Aya 3.35B configuration (source: HuggingFace config.json)

Why 8K Context?

8 K is an unusually short context window by 2026 standards — nearly every comparable small model ships at 128 K or longer. The Aya team's focus is multilingual coverage, not long-context retrieval: budget that would have gone into long-context pretraining instead goes into broader language-mix coverage (the Aya line covers 100+ languages) and higher-quality per-language data. For chat, translation, and summarization tasks on languages that competing 3B models handle poorly, this is a better budget allocation than an unused 128 K window.

Verdict: The Multilingual Small Model

Tiny Aya 3.35B is the default 3B-class pick for multilingual workloads — especially any language outside the top 10 by training-data volume. Architecturally it is conservative and nothing in the config is novel, but Cohere's Aya line has consistently out-benchmarked general 3B models on long-tail languages, and the tight 8 K context is a feature for the specific workload Aya targets, not a limitation.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.