Automated assembly of high-quality diploid human reference genomes

Kavli Affiliate: Erich Jarvis

| Authors: Erich D. Jarvis, Giulio Formenti, Arang Rhie, Andrea Guarracino, Chentao Yang, Jonathan Wood, Alan Tracey, Francoise Thibaud-Nissen, Mitchell R Vollger, David Porubsky, Haoyu Cheng, Mobin Asri, Glennis A Logsdon, Paolo Carnevali, Mark Chaisson, Chen-Shan Chin, Sarah Cody, Joanna Collins, Peter Ebert, Merly Escalona, Olivier Fedrigo, Robert S Fulton, Lucinda L Fulton, Shilpa Garg, Jay Ghurye, Edward Green, Ira M Hall, William H Harvey, Patrick Hasenfeld, Alex Hastie, Marina Haukness, Miten Jain, Melanie Kirsche, Mikhail Kolmogorov, Jan O Korbel, Sergey Koren, Jonas Korlach, Joyce Lee, Daofeng Li, Tina Lindsay, Julian Lucas, Feng Luo, Tobias Marschall, Jennifer McDaniel, Fan Nie, Hugh E Olsen, Nathan Olson, Trevor Pesout, Daniela Puiu, Allison Regier, Jue Ruan, Steven L Salzberg, Ashley D Sanders, Michael C Schatz, Anthony Schmitt, V alerie A Schneider, Siddarth Selvaraj, Kishwar Shafin, Alaina Shumate, Catherine Stober, James Torrance, Justin Wagner, Jianxin Wang, Aaron Wenger, Chuanle Xiao, Aleksey V Zimin, Guojie Zhang, Ting Wang, Heng Li, Erik Garrison, David Haussler, Justin M Zook, Evan E Eichler, Adam M Phillippy, Benedict Paten, Kerstin Howe, Karen H Miga and Human Pangenome Reference Consortium

| Summary:

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has greatly benefited society. However, it still has many gaps and errors, and does not represent a biological human genome since it is a blend of multiple individuals. Recently, a high-quality telomere-to-telomere reference genome, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a duplicate genome, and is thus nearly homozygous. To address these limitations, the Human Pangenome Reference Consortium (HPRC) recently formed with the goal of creating a collection of high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity. Here, in our first scientific report, we determined which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Approaches that used highly accurate long reads and parent-child data to sort haplotypes during assembly outperformed those that did not. Developing a combination of all the top performing methods, we generated our first high-quality diploid reference assembly, containing only ~4 gaps (range 0-12) per chromosome, most within + 1% of CHM13 length. Nearly 1/4th of protein coding genes have synonymous amino acid changes between haplotypes, and centromeric regions showed the highest density of variation. Our findings serve as a foundation for assembling near-complete diploid human genomes at the scale required for constructing a human pangenome reference that captures all genetic variation from single nucleotides to large structural rearrangements.

Leave a Reply Cancel reply