Autoregressive Pretraining with Mamba in Vision

Kavli Affiliate: Feng Wang

| First 5 Authors: Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu

| Summary:

The vision community has started to build with the recently developed state
space model, Mamba, as the new backbone for a range of tasks. This paper shows
that Mamba’s visual capability can be significantly enhanced through
autoregressive pretraining, a direction not previously explored.
Efficiency-wise, the autoregressive nature can well capitalize on the Mamba’s
unidirectional recurrent structure, enabling faster overall training speed
compared to other training strategies like mask modeling. Performance-wise,
autoregressive pretraining equips the Mamba architecture with markedly higher
accuracy over its supervised-trained counterparts and, more importantly,
successfully unlocks its scaling potential to large and even huge model sizes.
For example, with autoregressive pretraining, a base-size Mamba attains 83.2%
ImageNet accuracy, outperforming its supervised counterpart by 2.0%; our
huge-size Mamba, the largest Vision Mamba to date, attains 85.0% ImageNet
accuracy (85.5% when finetuned with $384times384$ inputs), notably surpassing
all other Mamba variants in vision. The code is available at
url{https://github.com/OliverRensu/ARM}.

| Search Query: ArXiv Query: search_query=au:”Feng Wang”&id_list=&start=0&max_results=3

Read More