Kavli Affiliate: Yi Zhou
| First 5 Authors: Liwen Tan, Yin Cao, Yi Zhou, ,
| Summary:
Modality discrepancies have perpetually posed significant challenges within
the realm of Automated Audio Captioning (AAC) and across all multi-modal
domains. Facilitating models in comprehending text information plays a pivotal
role in establishing a seamless connection between the two modalities of text
and audio. While recent research has focused on closing the gap between these
two modalities through contrastive learning, it is challenging to bridge the
difference between both modalities using only simple contrastive loss. This
paper introduces Enhance Depth of Text Comprehension (EDTC), which enhances the
model’s understanding of text information from three different perspectives.
First, we propose a novel fusion module, FUSER, which aims to extract shared
semantic information from different audio features through feature fusion. We
then introduced TRANSLATOR, a novel alignment module designed to align audio
features and text features along the tensor level. Finally, the weights are
updated by adding momentum to the twin structure so that the model can learn
information about both modalities at the same time. The resulting method
achieves state-of-the-art performance on AudioCaps datasets and demonstrates
results comparable to the state-of-the-art on Clotho datasets.
| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3