Kavli Affiliate: Xiang Zhang
| First 5 Authors: Yihang Yin, Siyu Huang, Xiang Zhang, ,
| Summary:
Deep neural networks (DNNs) have shown superior performances on various
multimodal learning problems. However, it often requires huge efforts to adapt
DNNs to individual multimodal tasks by manually engineering unimodal features
and designing multimodal feature fusion strategies. This paper proposes Bilevel
Multimodal Neural Architecture Search (BM-NAS) framework, which makes the
architecture of multimodal fusion models fully searchable via a bilevel
searching scheme. At the upper level, BM-NAS selects the inter/intra-modal
feature pairs from the pretrained unimodal backbones. At the lower level,
BM-NAS learns the fusion strategy for each feature pair, which is a combination
of predefined primitive operations. The primitive operations are elaborately
designed and they can be flexibly combined to accommodate various effective
feature fusion modules such as multi-head attention (Transformer) and Attention
on Attention (AoA). Experimental results on three multimodal tasks demonstrate
the effectiveness and efficiency of the proposed BM-NAS framework. BM-NAS
achieves competitive performances with much less search time and fewer model
parameters in comparison with the existing generalized multimodal NAS methods.
| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=10