Token-Label Alignment for Vision Transformers – Kavli Institute Pre-Print Publications

Kavli Affiliate: Zheng Zhu

| First 5 Authors: Han Xiao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

| Summary:

Data mixing strategies (e.g., CutMix) have shown the ability to greatly
improve the performance of convolutional neural networks (CNNs). They mix two
images as inputs for training and assign them with a mixed label with the same
ratio. While they are shown effective for vision transformers (ViTs), we
identify a token fluctuation phenomenon that has suppressed the potential of
data mixing strategies. We empirically observe that the contributions of input
tokens fluctuate as forward propagating, which might induce a different mixing
ratio in the output tokens. The training target computed by the original data
mixing strategy can thus be inaccurate, resulting in less effective training.
To address this, we propose a token-label alignment (TL-Align) method to trace
the correspondence between transformed tokens and the original tokens to
maintain a label for each token. We reuse the computed attention at each layer
for efficient token-label alignment, introducing only negligible additional
training costs. Extensive experiments demonstrate that our method improves the
performance of ViTs on image classification, semantic segmentation, objective
detection, and transfer learning tasks. Code is available at:
https://github.com/Euphoria16/TL-Align.

| Search Query: ArXiv Query: search_query=au:”Zheng Zhu”&id_list=&start=0&max_results=10