Kavli Affiliate: Feng Wang | First 5 Authors: Feng Wang, Jieru Mei, Alan Yuille, , | Summary: Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features […]
Continue.. SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference