Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge

Kavli Affiliate: Jing Wang

| First 5 Authors: Ruiming Chen, Junming Yang, Shiyu Xia, Xu Yang, Jing Wang

| Summary:

CLIP (Contrastive Language-Image Pre-training) has attracted widespread
attention for its multimodal generalizable knowledge, which is significant for
downstream tasks. However, the computational overhead of a large number of
parameters and large-scale pre-training poses challenges of pre-training a
different scale of CLIP. Learngene extracts the generalizable components termed
as learngene from an ancestry model and initializes diverse descendant models
with it. Previous Learngene paradigms fail to handle the generalizable
knowledge in multimodal scenarios. In this paper, we put forward the idea of
utilizing a multimodal block to extract the multimodal generalizable knowledge,
which inspires us to propose MM-LG (Multimodal Learngene), a novel framework
designed to extract and leverage generalizable components from CLIP.
Specifically, we first establish multimodal and unimodal blocks to extract the
multimodal and unimodal generalizable knowledge in a weighted-sum manner.
Subsequently, we employ these components to numerically initialize descendant
models of varying scales and modalities. Extensive experiments demonstrate
MM-LG’s effectiveness, which achieves performance gains over existing learngene
approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and
comparable or superior results to the pre-training and fine-tuning paradigm
(e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG
requires only around 25% of the parameter storage while reducing around 2.8
times pre-training costs for diverse model scales compared to the pre-training
and fine-tuning paradigm, making it particularly suitable for efficient
deployment across diverse downstream tasks.

| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3