Kavli Affiliate: Ran Wang
| First 5 Authors: Brian Shing-Hei Wong, Brian Shing-Hei Wong, , ,
| Summary:
Allergens, typically proteins capable of triggering adverse immune responses,
represent a significant public health challenge. To accurately identify
allergen proteins, we introduce Applm (Allergen Prediction with Protein
Language Models), a computational framework that leverages the 100-billion
parameter xTrimoPGLM protein language model. We show that Applm consistently
outperforms seven state-of-the-art methods in a diverse set of tasks that
closely resemble difficult real-world scenarios. These include identifying
novel allergens that lack similar examples in the training set, differentiating
between allergens and non-allergens among homologs with high sequence
similarity, and assessing functional consequences of mutations that create few
changes to the protein sequences. Our analysis confirms that xTrimoPGLM,
originally trained on one trillion tokens to capture general protein sequence
characteristics, is crucial for Applm’s performance by detecting important
differences among protein sequences. In addition to providing Applm as
open-source software, we also provide our carefully curated benchmark datasets
to facilitate future research.
| Search Query: ArXiv Query: search_query=au:”Ran Wang”&id_list=&start=0&max_results=3