Kavli Affiliate: Xiang Zhang | First 5 Authors: Xiang Zhang, Senyu Li, Zijun Wu, Ning Shi, | Summary: Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex text and image tasks. Numerous […]
Continue.. Lost in Translation: When GPT-4V(ision) Can’t See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond