ViT-P: Rethinking Data-efficient Vision Transformers from Locality

Kavli Affiliate: Ran Wang

| First 5 Authors: Bin Chen, Ran Wang, Di Ming, Xin Feng,

| Summary:

Recent advances of Transformers have brought new trust to computer vision
tasks. However, on small dataset, Transformers is hard to train and has lower
performance than convolutional neural networks. We make vision transformers as
data-efficient as convolutional neural networks by introducing multi-focal
attention bias. Inspired by the attention distance in a well-trained ViT, we
constrain the self-attention of ViT to have multi-scale localized receptive
field. The size of receptive field is adaptable during training so that optimal
configuration can be learned. We provide empirical evidence that proper
constrain of receptive field can reduce the amount of training data for vision
transformers. On Cifar100, our ViT-P Base model achieves the state-of-the-art
accuracy (83.16%) trained from scratch. We also perform analysis on ImageNet to
show our method does not lose accuracy on large data sets.

| Search Query: ArXiv Query: search_query=au:”Ran Wang”&id_list=&start=0&max_results=10

Read More

Leave a Reply