SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction

Kavli Affiliate: Wei Gao

| First 5 Authors: Xuan Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao

| Summary:

Recent advancements in large language models (LLMs) have extended their
capabilities to handle long contexts. However, increasing the number of model
layers and the length of input sequences significantly escalates the memory
required to store key-value (KV) cache, posing challenges for efficient
inference. To mitigate this issue, we present SimLayerKV, a simple yet
effective method that reduces inter-layer KV cache redundancies by selectively
dropping cache in identified lazy layers. Our approach is based on the
observation that certain layers in long-context LLMs exhibit "lazy" behavior,
contributing less to modeling long-range dependencies compared to non-lazy
layers. By analyzing attention weight patterns, we find that the behavior of
these lazy layers is consistent across tokens during generation for a given
input. This insight motivates our SimLayerKV, which identifies lazy layers and
reduces their KV cache accordingly. SimLayerKV is training-free, generalizable,
and can be implemented with only seven lines of code. We conduct extensive
experiments on three representative LLMs, e.g., LLaMA2-7B, LLaMA3-8B, and
Mistral-7B across 16 tasks from the LongBench benchmark. The results
demonstrate that SimLayerKV achieves a KV cache compression ratio of 5$times$
with only a 1.2% performance drop when combined with 4-bit quantization. Our
code is available at https://github.com/sail-sg/SimLayerKV.

| Search Query: ArXiv Query: search_query=au:”Wei Gao”&id_list=&start=0&max_results=3

Read More