SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Kavli Affiliate: Feng Wang

| First 5 Authors: Feng Wang, Jieru Mei, Alan Yuille, ,

| Summary:

Recent advances in contrastive language-image pretraining (CLIP) have
demonstrated strong capabilities in zero-shot classification by aligning visual
representations with target text embeddings in an image level. However, in
dense prediction tasks, CLIP often struggles to localize visual features within
an image and fails to give accurate pixel-level predictions, which prevents it
from functioning as a generalized visual foundation model. In this work, we aim
to enhance CLIP’s potential for semantic segmentation with minimal
modifications to its pretrained models. By rethinking self-attention, we
surprisingly find that CLIP can adapt to dense prediction tasks by simply
introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically,
we replace the traditional self-attention block of CLIP vision encoder’s last
layer by our CSA module and reuse its pretrained projection matrices of query,
key, and value, leading to a training-free adaptation approach for CLIP’s
zero-shot semantic segmentation. Extensive experiments show the advantage of
CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic
segmentation benchmarks highlighted in this paper, significantly outperforming
the existing SoTA’s 33.9% and the vanilla CLIP’s 14.1%.

| Search Query: ArXiv Query: search_query=au:”Feng Wang”&id_list=&start=0&max_results=3

Read More