Ling 2.5 1T

Architecture Specifications

Parameters63B active / 1T total

Context Window256K

Decoder TypeMoE

AttentionLightning Attention plus MLA

Active Parameters63B

Layers80

Hidden Size8,192

Vocabulary Size157K

Release Date2026-02

CategoryMixture of Experts

OrganizationUnknown

Key Features

Multi-head Latent AttentionLayer mix: 10 MLA + 70 Lightning AttentionKV cache: 11.2 KiB/token

Deep Dive

Overview

Ling 2.5 1T is a 1 T total / 63 B active parameter sparse MoE from Ant Group's inclusionAI research team, released February 2026. It is one of the handful of open-weight trillion-parameter models alongside Kimi K2 and Kimi K2.5. What makes Ling 2.5 distinctive is the attention stack: the shipped config.json shows a 10-layer MLA + 70-layer Lightning Attention interleave, an aggressive hybrid that most labs have not yet committed to at frontier scale.

Architecture at a Glance

Parameter	Value	Notes
Total parameters	≈ 1 T	MoE
Active parameters	≈ 63 B	per token — much denser than Kimi K2's 32B
Layers	80	10 MLA + 70 Lightning Attention
Attention	MLA + Lightning Attention	hybrid
KV cache	≈ 11.2 KiB/token	tiny — thanks to Lightning Attention
Max position	262,144	256 K native
Vocabulary	≈ 157,000
Precision	bfloat16

Ling 2.5 1T configuration (source: HuggingFace config.json)

Lightning Attention Dominates

Lightning Attention is a linear-attention variant — subquadratic attention cost in sequence length, with a fixed-size 'memory state' per layer rather than a growing KV cache. It is in the same family as Mamba-2, RWKV-7, and Kimi Linear's attention primitive (see the Kimi Linear deep dive for a longer explanation of how linear attention works).

Ling 2.5's 10 MLA + 70 Lightning ratio is unusually heavy on linear attention: only ~12% of layers pay quadratic attention cost. This is why the KV cache is only ≈ 11.2 KiB/token at 1 T total parameters, an order of magnitude smaller than Kimi K2's 68.6 KiB despite identical total parameter count. For very long context workloads (256 K native), Ling 2.5 should serve at much higher throughput than any pure-MLA trillion model.

63B Active: Denser than Peer MoEs

Most trillion-class MoEs keep active compute low — Kimi K2 and K2.5 both run 32 B active. Ling 2.5 runs 63 B active, nearly double, which means per-token compute cost is closer to a dense 63 B decoder than a dense 32 B. The tradeoff: Ling 2.5 is a more expensive model to serve per token, but it gets more expert compute per decision, which shows up in reasoning benchmarks. This is a deliberate different bet than Kimi's 'scale total parameters, keep active tight' posture.

Verdict: The Linear-Attention Trillion

Ling 2.5 1T is the most aggressive linear-attention / MLA hybrid at trillion-parameter scale. For teams willing to adopt a non-standard attention kernel (Lightning Attention needs custom CUDA to hit its theoretical throughput), the serving economics at 256 K context are unmatched by any pure-transformer peer. Read together with the Kimi Linear and xLSTM 7B deep dives for the broader 'post-transformer' research landscape.

References

Ling 2.5 1T — HuggingFace config.json

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Request demo Back to gallery

Ling 2.5 1TLing 2.5 1T