Kavli Affiliate: Yi Zhou
| First 5 Authors: Pengbo Hu, Xingyu Li, Yi Zhou, ,
| Summary:
As deep learning advances, there is an ever-growing demand for models capable
of synthesizing information from multi-modal resources to address the complex
tasks raised from real-life applications. Recently, many large multi-modal
datasets have been collected, on which researchers actively explore different
methods of fusing multi-modal information. However, little attention has been
paid to quantifying the contribution of different modalities within the
proposed models. In this paper, we propose the {bf SH}apley v{bf A}lue-based
{bf PE}rceptual (SHAPE) scores that measure the marginal contribution of
individual modalities and the degree of cooperation across modalities. Using
these scores, we systematically evaluate different fusion methods on different
multi-modal datasets for different tasks. Our experiments suggest that for some
tasks where different modalities are complementary, the multi-modal models
still tend to use the dominant modality alone and ignore the cooperation across
modalities. On the other hand, models learn to exploit cross-modal cooperation
when different modalities are indispensable for the task. In this case, the
scores indicate it is better to fuse different modalities at relatively early
stages. We hope our scores can help improve the understanding of how the
present multi-modal models operate on different modalities and encourage more
sophisticated methods of integrating multiple modalities.
| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=10