ViralQC: A Tool for Assessing Completeness and Contamination of Predicted Viral Contigs

Kavli Affiliate: Cheng Peng

| First 5 Authors: Cheng Peng, Jiayu Shang, Jiaojiao Guan, Yanni Sun,

| Summary:

Motivation: Viruses represent the most abundant biological entities on the
planet and play vital roles in diverse ecosystems. Cataloging viruses across
various environments is essential for understanding their properties and
functions. Metagenomic sequencing has emerged as the most comprehensive method
for virus discovery, enabling the sequencing of all genetic materials,
including viruses, from host or environmental samples. However, distinguishing
viral sequences from the vast background of cellular organism-derived reads in
metagenomic data remains a significant challenge. While several learning-based
tools, such as VirSorter2 and geNomad, have shown promise in identifying viral
contigs, they often experience varying degrees of false positive rates due to
noise in sequencing and assembly, shared genes between viruses and their hosts,
and the formation of proviruses within host genomes. This highlights the urgent
need for an accurate and efficient method to evaluate the quality of viral
contigs. Results: To address these challenges, we introduce ViralQC, a tool
designed to assess the quality of reported viral contigs or bins. ViralQC
identifies contamination regions within putative viral sequences using
foundation models trained on viral and cellular genomes and estimates viral
completeness through protein organization alignment. We evaluate ViralQC on
multiple datasets and compare its performance against CheckV, the
state-of-the-art in virus quality assessment. Notably, ViralQC correctly
identifies 38% more contamination than CheckV, while maintaining a median
absolute error of only 3%. In addition, ViralQC delivers more accurate results
for medium- to high-quality (>50% completeness) contigs, demonstrating its
superior performance in completeness estimation.

| Search Query: ArXiv Query: search_query=au:”Cheng Peng”&id_list=&start=0&max_results=3

Read More