Causal Image Modeling for Efficient Visual Understanding – Kavli Institute Pre-Print Publications

Kavli Affiliate: Feng Wang

| First 5 Authors: Feng Wang, Timing Yang, Yaodong Yu, Sucheng Ren, Guoyizhe Wei

| Summary:

In this work, we present a comprehensive analysis of causal image modeling
and introduce the Adventurer series models where we treat images as sequences
of patch tokens and employ uni-directional language models to learn visual
representations. This modeling paradigm allows us to process images in a
recurrent formulation with linear complexity relative to the sequence length,
which can effectively address the memory and computation explosion issues posed
by high-resolution and fine-grained images. In detail, we introduce two simple
designs that seamlessly integrate image inputs into the causal inference
framework: a global pooling token placed at the beginning of the sequence and a
flipping operation between every two layers. Extensive empirical studies
demonstrate the significant efficiency and effectiveness of this causal image
modeling paradigm. For example, our base-sized Adventurer model attains a
competitive test accuracy of 84.0% on the standard ImageNet-1k benchmark with
216 images/s training throughput, which is 5.3 times more efficient than vision
transformers to achieve the same result.

| Search Query: ArXiv Query: search_query=au:”Feng Wang”&id_list=&start=0&max_results=3