Kavli Affiliate: Xiang Zhang
| First 5 Authors: Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu
| Summary:
Recent developments in multimodal methodologies have marked the beginning of
an exciting era for models adept at processing diverse data types, encompassing
text, audio, and visual content. Models like GPT-4V, which merge computer
vision with advanced language processing, exhibit extraordinary proficiency in
handling intricate tasks that require a simultaneous understanding of both
textual and visual information. Prior research efforts have meticulously
evaluated the efficacy of these Vision Large Language Models (VLLMs) in various
domains, including object detection, image captioning, and other related
fields. However, existing analyses have often suffered from limitations,
primarily centering on the isolated evaluation of each modality’s performance
while neglecting to explore their intricate cross-modal interactions.
Specifically, the question of whether these models achieve the same level of
accuracy when confronted with identical task instances across different
modalities remains unanswered. In this study, we take the initiative to delve
into the interaction and comparison among these modalities of interest by
introducing a novel concept termed cross-modal consistency. Furthermore, we
propose a quantitative evaluation framework founded on this concept. Our
experimental findings, drawn from a curated collection of parallel
vision-language datasets developed by us, unveil a pronounced inconsistency
between the vision and language modalities within GPT-4V, despite its portrayal
as a unified multimodal model. Our research yields insights into the
appropriate utilization of such models and hints at potential avenues for
enhancing their design.
| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3