ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Kavli Affiliate: Li Xin Li

| First 5 Authors: Yifan Li, Xin Li, Tianqin Li, Wenbin He, Yu Kong

| Summary:

Vision foundation models (VFMs) have demonstrated remarkable performance
across a wide range of downstream tasks. While several VFM adapters have shown
promising results by leveraging the prior knowledge of VFMs, we identify two
inefficiencies in these approaches. First, the interaction between
convolutional neural network (CNN) and VFM backbone triggers early layer
gradient backpropagation. Second, existing methods require tuning all
components, adding complexity. Besides, these adapters alter VFM features,
underutilizing the prior knowledge. To tackle these challenges, we propose a
new approach called ViT-Split, based on a key observation: the layers of
several VFMs, like DINOv2, can be divided into two distinct components: an
extractor for learning low-level features and an adapter for learning
task-specific features. Leveraging this insight, we eliminate the CNN branch
and introduce two heads, task head and prior head, to the frozen VFM. The task
head is designed to learn task-specific features, mitigating the early gradient
propagation issue. The prior head is used to leverage the multi-scale prior
features from the frozen VFM, reducing tuning parameters and overfitting.
Extensive experiments on various tasks (e.g., segmentation, detection, depth
estimation, and visual question answering) validate the effectiveness and
efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to
$4times$ while achieving comparable or even better results on ADE20K, compared
to other VFM adapters.

| Search Query: ArXiv Query: search_query=au:”Li Xin Li”&id_list=&start=0&max_results=3