Kavli Affiliate: Wei Gao | First 5 Authors: , , , , | Summary: Large-scale image-language pretrained models, e.g., CLIP, have demonstrated remarkable proficiency in acquiring general multi-modal knowledge through web-scale image-text data. Despite the impressive performance of image-language models on various image tasks, how to effectively expand them on general video understanding remains an […]
Continue.. Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding