Kavli Affiliate: Feng Wang
| First 5 Authors: Kimi Team, Kimi Team, , ,
| Summary:
We introduce Kimi Linear, a hybrid linear attention architecture that, for
the first time, outperforms full attention under fair comparisons across
various scenarios — including short-context, long-context, and reinforcement
learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an
expressive linear attention module that extends Gated DeltaNet with a
finer-grained gating mechanism, enabling more effective use of limited
finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware
efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR)
transition matrices, which substantially reduces computation compared to the
general DPLR formulation while remaining more consistent with the classical
delta rule.
We pretrain a Kimi Linear model with 3B activated parameters and 48B total
parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention
(MLA). Our experiments show that with an identical training recipe, Kimi Linear
outperforms full MLA with a sizeable margin across all evaluated tasks, while
reducing KV cache usage by up to 75% and achieving up to 6 times decoding
throughput for a 1M context. These results demonstrate that Kimi Linear can be
a drop-in replacement for full attention architectures with superior
performance and efficiency, including tasks with longer input and output
lengths.
To support further research, we open-source the KDA kernel and vLLM
implementations, and release the pre-trained and instruction-tuned model
checkpoints.
| Search Query: ArXiv Query: search_query=au:”Feng Wang”&id_list=&start=0&max_results=3