Kavli Affiliate: Yi Zhou
| First 5 Authors: Weichen Dai, Xingyu Li, Pengbo Hu, Zeyu Wang, Ji Qi
| Summary:
Learning effective joint representations has been a central task in
multimodal sentiment analysis. Previous methods focus on leveraging the
correlations between different modalities and enhancing performance through
sophisticated fusion techniques. However, challenges still exist due to the
inherent heterogeneity of distinct modalities, which may lead to distributional
gap, impeding the full exploitation of inter-modal information and resulting in
redundancy and impurity in the information extracted from features. To address
this problem, we introduce the Multimodal Information Disentanglement (MInD)
approach. MInD decomposes the multimodal inputs into a modality-invariant
component, a modality-specific component, and a remnant noise component for
each modality through a shared encoder and multiple private encoders. The
shared encoder aims to explore the shared information and commonality across
modalities, while the private encoders are deployed to capture the distinctive
information and characteristic features. These representations thus furnish a
comprehensive perspective of the multimodal data, facilitating the fusion
process instrumental for subsequent prediction tasks. Furthermore, MInD
improves the learned representations by explicitly modeling the task-irrelevant
noise in an adversarial manner. Experimental evaluations conducted on benchmark
datasets, including CMU-MOSI, CMU-MOSEI, and UR-Funny, demonstrate MInD’s
superior performance over existing state-of-the-art methods in both multimodal
emotion recognition and multimodal humor detection tasks.
| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3