Artificial intelligence redefines RNA virus discovery

Kavli Affiliate: Li Zhao

| Authors: Xin Hou, Yong He, Pan Fang, Shi-Qiang Mei, Zan Xu, Wei-Chen Wu, Jun-Hua Tian, Shun Zhang, Zhen-Yu Zeng, Qin-Yu Gou, Gen-Yang Xin, Shi-Jia Le, Yin-Yue Xia, Yu-Lan Zhou, Feng-Ming Hui, Yuan-Fei Pan, John-Sebastian Eden, Zhao-Hui Yang, Chong Han, Yue-Long Shu, Deyin Guo, Jun Li, Edward C Holmes, Zhao-Rong Li and Mang Shi

| Summary:

RNA viruses are diverse components of global ecosystems. The metagenomic identification of RNA viruses is currently limited to those with sequence similarity to known viruses, such that highly divergent viruses that comprise the “dark matter” of the virosphere remain challenging to detect. We developed a deep learning algorithm – LucaProt – to search for highly divergent RNA-dependent RNA polymerase (RdRP) sequences in 10,487 global meta- transcriptomes. LucaProt integrates both sequence and structural information to accurately and efficiently detect RdRP sequences. With this approach we identified 180,571 RNA viral species and 180 superclades (viral phyla/classes). This is the broadest diversity of RNA viruses described to date, including many viruses undetectable using BLAST or HMM approaches. The newly identified RNA viruses were present in diverse ecological niches, including the air, hot springs and hydrothermal vents, and both virus diversity and abundance varied substantially among ecological types. We also identified the longest RNA virus genome (nido-like) observed so far, at 47,250 nucleotides, and expanded the diversity of RNA bacteriophage to more than ten phyla/classes. This study marks the beginning of a new era of virus discovery, with the potential to redefine our understanding of the global virosphere and reshape our understanding of virus evolutionary history.

Read More