LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Zongyu Wu, Yuwei Niu, Hongcheng Gao, Minhua Lin, Zhiwei Zhang

| Summary:

Large Vision-Language Models (LVLMs) have shown impressive performance in
various tasks. However, LVLMs suffer from hallucination, which hinders their
adoption in the real world. Existing studies emphasized that the strong
language priors of LVLMs can overpower visual information, causing
hallucinations. However, the positive role of language priors is the key to a
powerful LVLM. If the language priors are too weak, LVLMs will struggle to
leverage rich parameter knowledge and instruction understanding abilities to
complete tasks in challenging visual scenarios where visual information alone
is insufficient. Therefore, we propose a benchmark called LanP to rethink the
impact of Language Priors in LVLMs. It is designed to investigate how strong
language priors are in current LVLMs. LanP consists of 170 images and 340
corresponding well-designed questions. Extensive experiments on 25 popular
LVLMs reveal that many LVLMs’ language priors are not strong enough to
effectively aid question answering when objects are partially hidden. Many
models, including GPT-4 Turbo, exhibit an accuracy below 0.5 in such a
scenario.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3

Read More