Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

Kavli Affiliate: Wei Gao

| First 5 Authors: Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang

| Summary:

Due to the auto-regressive nature of current video large language models
(Video-LLMs), the inference latency increases as the input sequence length
grows, posing challenges for the efficient processing of video sequences that
are usually very long. We observe that during decoding, the attention scores of
most tokens in Video-LLMs tend to be sparse and concentrated, with only certain
tokens requiring comprehensive full attention. Based on this insight, we
introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two
distinct modules: one leveraging sparse top-K attention and the other employing
dense full attention. These modules collaborate to accelerate Video-LLMs
without loss. The fast (sparse) model speculatively decodes multiple tokens,
while the slow (dense) model verifies them in parallel. StD is a tuning-free,
plug-and-play solution that achieves up to a 1.94$times$ walltime speedup in
video processing. It maintains model performance while enabling a seamless
transition from a standard Video-LLM to a sparse Video-LLM with minimal code
modifications.

| Search Query: ArXiv Query: search_query=au:”Wei Gao”&id_list=&start=0&max_results=3