Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus

Kavli Affiliate: Li Zhao

| Authors: Yuan-Fei Pan, Yong He, Yu-Qi Liu, Yong-Tao Shan, Shu-Ning Liu, Xue Liu, Xiaoyun Pan, Yinqi Bai, Zan Xu, Zheng Wang, Jieping Ye, Edward C. Holmes, Bo Li, Yao-Qing Chen, Zhao-Rong Li and Mang Shi

| Summary:

Predicting the evolution and function of viruses is a fundamental biological challenge, largely due to high levels of sequence divergence and the limited knowledge available in comparison to cellular organisms. To address this, we present LucaVirus, a unified, multi-modal foundation model specifically designed for viruses. Trained on 25.4 billion nucleotide and amino acid tokens encompassing nearly all known viruses, LucaVirus learns biologically meaningful representations that capture the relationships between nucleotide and amino acid sequences, protein/gene homology, and evolutionary divergence. Building on these interpretable embeddings, we developed specialized downstream models to address key challenges in virology: (i) identify viruses hidden within genomic “dark matter”, (ii) characterize enzymatic activities of unknown proteins, (iii) predict viral evolvability, and (iv) discover antibody drugs for emerging viruses. LucaVirus achieves state-of-the-art performance in tasks (i), (iii), and (iv), and matches the leading models in task (ii) with one-third the parameter size. These findings demonstrate the power of a unified foundation model to comprehensively decode the viral world. LucaVirus is a new tool in AI-driven virology, offering an efficient and versatile platform for board applications from virus discovery to functional predictions.

Read More