Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Shijun Yang, Shijun Yang, , ,

| Summary:

Prompt learning facilitates the efficient adaptation of Vision-Language
Models (VLMs) to various downstream tasks. However, it faces two significant
challenges: (1) inadequate modeling of class embedding distributions for unseen
instances, leading to suboptimal generalization on novel classes; (2)
prevailing methodologies predominantly confine cross-modal alignment to the
final output layer of vision and text encoders, which fundamentally limits
their capacity to preserve topological consistency with pre-trained multi-modal
embedding spaces. To this end, we introduce MuGCP (Multi-modal Mutual-Guidance
Conditional Prompt Learning), a novel paradigm designed for conditional prompt
generation. MuGCP leverages Multi-modal Large Language Models (MLLMs) as
conditional prompt learners to adaptively generate Semantic Conditional Prompts
(SCP) that incorporate rich, fine-grained high-level semantic knowledge for
image instances. To ensure effective alignment and interaction across the
multi-modal space of Vision-Language Models (VLMs), we introduce the Attention
Mutual-Guidance (AMG) module, which facilitates interactions between visual and
semantic information. Through mutual guidance, the AMG module generates Visual
Conditional Prompts (VCP), enhancing the model’s performance in multi-modal
tasks. Additionally, we present a Multi-Prompt Fusion (MPF) mechanism that
integrates SCP and VCP with contextual prompts, ensuring seamless coordination
among the different prompts and enhancing the modeling of class embeddings and
instance-specific knowledge. Our MuGCP outperforms existing state-of-the-art
methods on 14 different datasets. The code will be made available after
publication.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3