Dynamic Vision Mamba

Kavli Affiliate: Zheng Zhu

| First 5 Authors: Mengxuan Wu, Zekai Li, Zhiyuan Liang, Moyang Li, Xuanlei Zhao

| Summary:

Mamba-based vision models have gained extensive attention as a result of
being computationally more efficient than attention-based models. However,
spatial redundancy still exists in these models, represented by token and block
redundancy. For token redundancy, we analytically find that early token pruning
methods will result in inconsistency between training and inference or
introduce extra computation for inference. Therefore, we customize token
pruning to fit the Mamba structure by rearranging the pruned sequence before
feeding it into the next Mamba block. For block redundancy, we allow each image
to select SSM blocks dynamically based on an empirical observation that the
inference speed of Mamba-based vision models is largely affected by the number
of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively
reduces FLOPs with minor performance drops. We achieve a reduction of 35.2%
FLOPs with only a loss of accuracy of 1.7% on Vim-S. It also generalizes well
across different Mamba vision model architectures and different vision tasks.
Our code will be made public.

| Search Query: ArXiv Query: search_query=au:”Zheng Zhu”&id_list=&start=0&max_results=3

Read More