Kavli Affiliate: Jing Wang
| First 5 Authors: Meiyue Song, Zhihua Yu, Jiaxin Wang, Jiarui Wang, Yuting Lu
| Summary:
The conventional pretraining-and-finetuning paradigm, while effective for
common diseases with ample data, faces challenges in diagnosing data-scarce
occupational diseases like pneumoconiosis. Recently, large language models
(LLMs) have exhibits unprecedented ability when conducting multiple tasks in
dialogue, bringing opportunities to diagnosis. A common strategy might involve
using adapter layers for vision-language alignment and diagnosis in a dialogic
manner. Yet, this approach often requires optimization of extensive learnable
parameters in the text branch and the dialogue head, potentially diminishing
the LLMs’ efficacy, especially with limited training data. In our work, we
innovate by eliminating the text branch and substituting the dialogue head with
a classification head. This approach presents a more effective method for
harnessing LLMs in diagnosis with fewer learnable parameters. Furthermore, to
balance the retention of detailed image information with progression towards
accurate diagnosis, we introduce the contextual multi-token engine. This engine
is specialized in adaptively generating diagnostic tokens. Additionally, we
propose the information emitter module, which unidirectionally emits
information from image tokens to diagnosis tokens. Comprehensive experiments
validate the superiority of our methods and the effectiveness of proposed
modules. Our codes can be found at
https://github.com/CodeMonsterPHD/PneumoLLM/tree/main.
| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3