Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning

Kavli Affiliate: Cheng Peng

| First 5 Authors: Cheng Peng, Kai Zhang, Mengxian Lyu, Hongfang Liu, Lichao Sun

| Summary:

To advance biomedical vison-language model capabilities through scaling up,
fine-tuning, and instruction tuning, develop vision-language models with
improved performance in handling long text, explore strategies to efficiently
adopt vision language models for diverse multi-modal biomedical tasks, and
examine the zero-shot learning performance.
We developed two biomedical vision language models, BiomedGPT-Large and
BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture.
We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal
biomedical tasks including one image-only task (image classification), three
language-only tasks (text understanding, text summarization and question
answering), and two vision-language tasks (visual question answering and image
captioning). We compared the developed scaled models with our previous
BiomedGPT-Base model and existing prestigious models reported in the
literature. We instruction-tuned the two models using a large-scale multi-modal
biomedical instruction-tuning dataset and assessed the zero-shot learning
performance and alignment accuracy.

| Search Query: ArXiv Query: search_query=au:”Cheng Peng”&id_list=&start=0&max_results=3