Kavli Affiliate: Jing Wang
| First 5 Authors: Haokun Chen, Xu Yang, Yuhang Huang, Zihan Wu, Jing Wang
| Summary:
After pre-training by generating the next word conditional on previous words,
the Language Model (LM) acquires the ability of In-Context Learning (ICL) that
can learn a new task conditional on the context of the given in-context
examples (ICEs). Similarly, visually-conditioned Language Modelling is also
used to train Vision-Language Models (VLMs) with ICL ability. However, such
VLMs typically exhibit weaker classification abilities compared to contrastive
learning-based models like CLIP, since the Language Modelling objective does
not directly contrast whether an object is paired with a text. To improve the
ICL of classification, using more ICEs to provide more knowledge is a
straightforward way. However, this may largely increase the selection time, and
more importantly, the inclusion of additional in-context images tends to extend
the length of the in-context sequence beyond the processing capacity of a VLM.
To alleviate these limitations, we propose to manipulate the label space of
each ICE to increase its knowledge density, allowing for fewer ICEs to convey
as much information as a larger set would. Specifically, we propose two
strategies which are Label Distribution Enhancement and Visual Descriptions
Enhancement to improve In-context classification performance on diverse
datasets, including the classic ImageNet and more fine-grained datasets like
CUB-200. Specifically, using our approach on ImageNet, we increase accuracy
from 74.70% in a 4-shot setting to 76.21% with just 2 shots. surpassing CLIP
by 0.67%. On CUB-200, our method raises 1-shot accuracy from 48.86% to
69.05%, 12.15% higher than CLIP. The code is given in
https://anonymous.4open.science/r/MLS_ICC.
| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3