Kavli Affiliate: Xiang Zhang
| First 5 Authors: Xiang Zhang, Senyu Li, Zijun Wu, Ning Shi,
| Summary:
Recent advancements in multimodal techniques open exciting possibilities for
models excelling in diverse tasks involving text, audio, and image processing.
Models like GPT-4V, blending computer vision and language modeling, excel in
complex text and image tasks. Numerous prior research endeavors have diligently
examined the performance of these Vision Large Language Models (VLLMs) across
tasks like object detection, image captioning and others. However, these
analyses often focus on evaluating the performance of each modality in
isolation, lacking insights into their cross-modal interactions. Specifically,
questions concerning whether these vision-language models execute vision and
language tasks consistently or independently have remained unanswered. In this
study, we draw inspiration from recent investigations into multilingualism and
conduct a comprehensive analysis of model’s cross-modal interactions. We
introduce a systematic framework that quantifies the capability disparities
between different modalities in the multi-modal setting and provide a set of
datasets designed for these evaluations. Our findings reveal that models like
GPT-4V tend to perform consistently modalities when the tasks are relatively
simple. However, the trustworthiness of results derived from the vision
modality diminishes as the tasks become more challenging. Expanding on our
findings, we introduce "Vision Description Prompting," a method that
effectively improves performance in challenging vision-related tasks.
| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3