Skip to content
MoEVerified
Unknown · 2026-02

Ling 2.5 1T

Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.

Ling 2.5 1T decoder block architecture: Attention: Lightning Attention plus MLA. Normalization: RMSNorm. FFN: Mixture of Experts (63B active parameters). Position encoding: RoPE. Scale: 1T, 256K context, 80 layers. Decoder type: MoE.

Lightning Attention plus MLA·MoE · 63B active
63B active / 1T total|256K context|Lightning Attention plus MLA|MoE

Architecture Specifications

Parameters63B active / 1T total
Context Window256K
Decoder TypeMoE
AttentionLightning Attention plus MLA
Active Parameters63B
Layers80
Hidden Size8,192
Vocabulary Size157K
Release Date2026-02
CategoryMixture of Experts
OrganizationUnknown

Key Features

Multi-head Latent AttentionLayer mix: 10 MLA + 70 Lightning AttentionKV cache: 11.2 KiB/token

Deep Dive

Overview

Ling 2.5 1T is a 1 T total / 63 B active parameter sparse MoE from Ant Group's inclusionAI research team, released February 2026. It is one of the handful of open-weight trillion-parameter models alongside Kimi K2 and Kimi K2.5. What makes Ling 2.5 distinctive is the attention stack: the shipped config.json shows a 10-layer MLA + 70-layer Lightning Attention interleave, an aggressive hybrid that most labs have not yet committed to at frontier scale.

Architecture at a Glance

ParameterValueNotes
Total parameters≈ 1 TMoE
Active parameters≈ 63 Bper token — much denser than Kimi K2's 32B
Layers8010 MLA + 70 Lightning Attention
AttentionMLA + Lightning Attentionhybrid
KV cache≈ 11.2 KiB/tokentiny — thanks to Lightning Attention
Max position262,144256 K native
Vocabulary≈ 157,000
Precisionbfloat16
Ling 2.5 1T configuration (source: HuggingFace config.json)

Lightning Attention Dominates

Lightning Attention is a linear-attention variant — subquadratic attention cost in sequence length, with a fixed-size 'memory state' per layer rather than a growing KV cache. It is in the same family as Mamba-2, RWKV-7, and Kimi Linear's attention primitive (see the Kimi Linear deep dive for a longer explanation of how linear attention works).

Ling 2.5's 10 MLA + 70 Lightning ratio is unusually heavy on linear attention: only ~12% of layers pay quadratic attention cost. This is why the KV cache is only ≈ 11.2 KiB/token at 1 T total parameters, an order of magnitude smaller than Kimi K2's 68.6 KiB despite identical total parameter count. For very long context workloads (256 K native), Ling 2.5 should serve at much higher throughput than any pure-MLA trillion model.

63B Active: Denser than Peer MoEs

Most trillion-class MoEs keep active compute low — Kimi K2 and K2.5 both run 32 B active. Ling 2.5 runs 63 B active, nearly double, which means per-token compute cost is closer to a dense 63 B decoder than a dense 32 B. The tradeoff: Ling 2.5 is a more expensive model to serve per token, but it gets more expert compute per decision, which shows up in reasoning benchmarks. This is a deliberate different bet than Kimi's 'scale total parameters, keep active tight' posture.

Verdict: The Linear-Attention Trillion

Ling 2.5 1T is the most aggressive linear-attention / MLA hybrid at trillion-parameter scale. For teams willing to adopt a non-standard attention kernel (Lightning Attention needs custom CUDA to hit its theoretical throughput), the serving economics at 256 K context are unmatched by any pure-transformer peer. Read together with the Kimi Linear and xLSTM 7B deep dives for the broader 'post-transformer' research landscape.

References

Enterprise AI platform

Compare, evaluate, and deploy LLM architectures at scale

Colaberry AI provides architecture specifications, benchmark comparisons, and deployment guidance for enterprise AI teams.

Catalog Workspace

Discover agents, MCP servers, and skills in one governed surface

Use structured catalog views to compare readiness, ownership, integrations, and deployment posture before rollout.