Kavli Affiliate: Cheng Peng
| First 5 Authors: Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu
| Summary:
Recent large language models (LLMs) such as ChatGPT and LLaMA have shown
great promise in many AI applications. However, their performance on medical
tasks is suboptimal and can be improved by training on extensive
domain-specific datasets. This study introduces Me LLaMA, a medical LLM family
that includes foundation models – Me LLaMA 13/70B, along with their
chat-enhanced versions – Me LLaMA 13/70B-chat, developed through continual
pre-training and instruction tuning of LLaMA2 using large medical datasets. Our
domain-specific data suite for training and evaluation includes a large-scale,
continual pre-training dataset with 129B tokens, an instruction tuning dataset
with 214k samples, and a new medical evaluation benchmark (MIBE) across six
tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me
LLaMA models achieve overall better performance than existing open-source
medical LLMs in zero-shot, few-shot and supervised learning abilities. Their
zero-shot performance is comparable with ChatGPT across 7 out of 8 datasets,
with a slight variance of within 3%, and yet falls short when compared to
GPT-4. In addition, we investigated the catastrophic forgetting problem, and
our results show that Me LLaMA models outperform other open-source medical LLMs
in mitigating this issue. Me LLaMA is one of the largest open-source medical
foundation LLMs that use both biomedical and clinical data. It exhibits
superior performance across both general and medical tasks compared to other
open-source medical LLMs, rendering it an attractive choice for medical AI
applications. We release our models, datasets, and evaluation scripts at:
https://github.com/BIDS-Xu-Lab/Me-LLaMA.
| Search Query: ArXiv Query: search_query=au:”Cheng Peng”&id_list=&start=0&max_results=3