Kavli Affiliate: Cheng Peng
| First 5 Authors: Fuchuan Qu, Cheng Peng, Jiaojiao Guan, Donglin Wang, Yanni Sun
| Summary:
Motivation: Nucleocytoplasmic large DNA viruses (NCLDVs) are notable for
their large genomes and extensive gene repertoires, which contribute to their
widespread environmental presence and critical roles in processes such as host
metabolic reprogramming and nutrient cycling. Metagenomic sequencing has
emerged as a powerful tool for uncovering novel NCLDVs in environmental
samples. However, identifying NCLDV sequences in metagenomic data remains
challenging due to their high genomic diversity, limited reference genomes, and
shared regions with other microbes. Existing alignment-based and machine
learning methods struggle with achieving optimal trade-offs between sensitivity
and precision. Results: In this work, we present GiantHunter, a reinforcement
learning-based tool for identifying NCLDVs from metagenomic data. By employing
a Monte Carlo tree search strategy, GiantHunter dynamically selects
representative non-NCLDV sequences as the negative training data, enabling the
model to establish a robust decision boundary. Benchmarking on rigorously
designed experiments shows that GiantHunter achieves high precision while
maintaining competitive sensitivity, improving the F1-score by 10% and reducing
computational cost by 90% compared to the second-best method. To demonstrate
its real-world utility, we applied GiantHunter to 60 metagenomic datasets
collected from six cities along the Yangtze River, located both upstream and
downstream of the Three Gorges Dam. The results reveal significant differences
in NCLDV diversity correlated with proximity to the dam, likely influenced by
reduced flow velocity caused by the dam. These findings highlight the potential
of GiantSeeker to advance our understanding of NCLDVs and their ecological
roles in diverse environments.
| Search Query: ArXiv Query: search_query=au:”Cheng Peng”&id_list=&start=0&max_results=3