Using artificial intelligence to document the hidden RNA virosphere

Kavli Affiliate: Li Zhao

| Authors: Xin Hou, Yong He, Pan Fang, Shi-Qiang Mei, Zan Xu, Wei-Chen Wu, Jun-Hua Tian, Shun Zhang, Zhen-Yu Zeng, Qin-Yu Gou, Gen-Yang Xin, Shi-Jia Le, Yin-Yue Xia, Yu-Lan Zhou, Feng-Ming Hui, Yuan-Fei Pan, John-Sebastian Eden, Zhao-Hui Yang, Chong Han, Yue-Long Shu, Deyin Guo, Jun Li, Edward C Holmes, Zhao-Rong Li and Mang Shi

| Summary:

RNA viruses are diverse and abundant components of global ecosystems. The metagenomic identification of RNA viruses is currently limited to those that exhibit sequence similarity to known viruses. Consequently, the detection of highly divergent viruses with poor sequence similarity to known viruses remains a challenging task. We developed a deep learning algorithm, termed LucaProt, to identify highly divergent RNA-dependent RNA polymerase (RdRP) sequences in 10,487 metatranscriptomes from diverse global ecosystems. LucaProt integrates both sequence and structural information to accurately and efficiently detect RdRP sequences. With this approach we identified 161,979 putative RNA virus species and 180 RNA virus supergroups, among which only 21 contained members of phyla or classes currently defined by the International Committee on Taxonomy of Viruses, and includes many groups that were either undescribed or poorly characterized in previous studies. The newly identified RNA viruses were present in diverse ecological settings, including the air, hot springs and hydrothermal vents, and both virus diversity and abundance varied substantially among ecosystems. We also identified the longest RNA virus genome (nido-like virus) documented to date, at 47,250 nucleotides. This study marks the beginning of a new era of virus discovery, providing computational tools that will help expand our understanding of the global RNA virosphere and of virus evolution.

Read More