Search | VHL Regional Portal

1.

The complete sequence and comparative analysis of ape sex chromosomes.

Makova, Kateryna D; Pickett, Brandon D; Harris, Robert S; Hartley, Gabrielle A; Cechova, Monika; Pal, Karol; Nurk, Sergey; Yoo, DongAhn; Li, Qiuhui; Hebbar, Prajna; McGrath, Barbara C; Antonacci, Francesca; Aubel, Margaux; Biddanda, Arjun; Borchers, Matthew; Bornberg-Bauer, Erich; Bouffard, Gerard G; Brooks, Shelise Y; Carbone, Lucia; Carrel, Laura; Carroll, Andrew; Chang, Pi-Chuan; Chin, Chen-Shan; Cook, Daniel E; Craig, Sarah J C; de Gennaro, Luciana; Diekhans, Mark; Dutra, Amalia; Garcia, Gage H; Grady, Patrick G S; Green, Richard E; Haddad, Diana; Hallast, Pille; Harvey, William T; Hickey, Glenn; Hillis, David A; Hoyt, Savannah J; Jeong, Hyeonsoo; Kamali, Kaivan; Pond, Sergei L Kosakovsky; LaPolice, Troy M; Lee, Charles; Lewis, Alexandra P; Loh, Yong-Hwee E; Masterson, Patrick; McGarvey, Kelly M; McCoy, Rajiv C; Medvedev, Paul; Miga, Karen H; Munson, Katherine M.

Nature ; 2024 May 29.

Article in English | MEDLINE | ID: mdl-38811727

ABSTRACT

Apes possess two sex chromosomes-the male-specific Y chromosome and the X chromosome, which is present in both males and females. The Y chromosome is crucial for male reproduction, with deletions being linked to infertility1. The X chromosome is vital for reproduction and cognition2. Variation in mating patterns and brain function among apes suggests corresponding differences in their sex chromosomes. However, owing to their repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the methodology developed for the telomere-to-telomere (T2T) human genome, we produced gapless assemblies of the X and Y chromosomes for five great apes (bonobo (Pan paniscus), chimpanzee (Pan troglodytes), western lowland gorilla (Gorilla gorilla gorilla), Bornean orangutan (Pongo pygmaeus) and Sumatran orangutan (Pongo abelii)) and a lesser ape (the siamang gibbon (Symphalangus syndactylus)), and untangled the intricacies of their evolution. Compared with the X chromosomes, the ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements-owing to the accumulation of lineage-specific ampliconic regions, palindromes, transposable elements and satellites. Many Y chromosome genes expand in multi-copy families and some evolve under purifying selection. Thus, the Y chromosome exhibits dynamic evolution, whereas the X chromosome is more stable. Mapping short-read sequencing data to these assemblies revealed diversity and selection patterns on sex chromosomes of more than 100 individual great apes. These reference assemblies are expected to inform human evolution and conservation genetics of non-human apes, all of which are endangered species.

2.

Structural and genetic diversity in the secreted mucins, MUC5AC and MUC5B.

Plender, Elizabeth G; Prodanov, Timofey; Hsieh, PingHsun; Nizamis, Evangelos; Harvey, William T; Sulovari, Arvis; Munson, Katherine M; Kaufman, Eli J; O'Neal, Wanda K; Valdmanis, Paul N; Marschall, Tobias; Bloom, Jesse D; Eichler, Evan E.

bioRxiv ; 2024 Mar 20.

Article in English | MEDLINE | ID: mdl-38562829

ABSTRACT

The secreted mucins MUC5AC and MUC5B play critical defensive roles in airway pathogen entrapment and mucociliary clearance by encoding large glycoproteins with variable number tandem repeats (VNTRs). These polymorphic and degenerate protein coding VNTRs make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5761-5762aa); however, seven haplotypes have expanded VNTRs (6291-7019aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5249-6325aa) with cysteine-rich domain and VNTR copy number variation. We grouped MUC5AC alleles into three phylogenetic clades: H1 (46%, ~5654aa), H2 (33%, ~5742aa), and H3 (7%, ~6325aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium (LD) and Tajima's D analyses reveal that East Asians carry exceptionally large MUC5AC LD blocks with an excess of rare variation (p<0.05). To validate this result, we used Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observed signatures of positive selection in H1 and H2 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Africans and Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium, consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein coding VNTRs for improved disease associations.

3.

The variation and evolution of complete human centromeres.

Logsdon, Glennis A; Rozanski, Allison N; Ryabov, Fedor; Potapova, Tamara; Shepelev, Valery A; Catacchio, Claudia R; Porubsky, David; Mao, Yafei; Yoo, DongAhn; Rautiainen, Mikko; Koren, Sergey; Nurk, Sergey; Lucas, Julian K; Hoekzema, Kendra; Munson, Katherine M; Gerton, Jennifer L; Phillippy, Adam M; Ventura, Mario; Alexandrov, Ivan A; Eichler, Evan E.

Nature ; 629(8010): 136-145, 2024 May.

Article in English | MEDLINE | ID: mdl-38570684

ABSTRACT

Human centromeres have been traditionally very difficult to sequence and assemble owing to their repetitive nature and large size1. As a result, patterns of human centromeric variation and models for their evolution and function remain incomplete, despite centromeres being among the most rapidly mutating regions2,3. Here, using long-read sequencing, we completely sequenced and assembled all centromeres from a second human genome and compared it to the finished reference genome4,5. We find that the two sets of centromeres show at least a 4.1-fold increase in single-nucleotide variation when compared with their unique flanks and vary up to 3-fold in size. Moreover, we find that 45.8% of centromeric sequence cannot be reliably aligned using standard methods owing to the emergence of new α-satellite higher-order repeats (HORs). DNA methylation and CENP-A chromatin immunoprecipitation experiments show that 26% of the centromeres differ in their kinetochore position by >500 kb. To understand evolutionary change, we selected six chromosomes and sequenced and assembled 31 orthologous centromeres from the common chimpanzee, orangutan and macaque genomes. Comparative analyses reveal a nearly complete turnover of α-satellite HORs, with characteristic idiosyncratic changes in α-satellite HORs for each species. Phylogenetic reconstruction of human haplotypes supports limited to no recombination between the short (p) and long (q) arms across centromeres and reveals that novel α-satellite HORs share a monophyletic origin, providing a strategy to estimate the rate of saltatory amplification and mutation of human centromeric DNA.

Subject(s)

Centromere , Evolution, Molecular , Genetic Variation , Animals , Humans , Centromere/genetics , Centromere/metabolism , Centromere Protein A/metabolism , DNA Methylation/genetics , DNA, Satellite/genetics , Kinetochores/metabolism , Macaca/genetics , Pan troglodytes/genetics , Polymorphism, Single Nucleotide/genetics , Pongo/genetics , Male , Female , Reference Standards , Chromatin Immunoprecipitation , Haplotypes , Mutation , Gene Amplification , Sequence Alignment , Chromatin/genetics , Chromatin/metabolism , Species Specificity

4.

Independent expansion, selection and hypervariability of the TBC1D3 gene family in humans.

Guitart, Xavi; Porubsky, David; Yoo, DongAhn; Dougherty, Max L; Dishuck, Philip C; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Knuth, Jordan; Chang, Stephen; Pastinen, Tomi; Eichler, Evan E.

bioRxiv ; 2024 Mar 13.

Article in English | MEDLINE | ID: mdl-38654825

ABSTRACT

TBC1D3 is a primate-specific gene family that has expanded in the human lineage and has been implicated in neuronal progenitor proliferation and expansion of the frontal cortex. The gene family and its expression have been challenging to investigate because it is embedded in high-identity and highly variable segmental duplications. We sequenced and assembled the gene family using long-read sequencing data from 34 humans and 11 nonhuman primate species. Our analysis shows that this particular gene family has independently duplicated in at least five primate lineages, and the duplicated loci are enriched at sites of large-scale chromosomal rearrangements on chromosome 17. We find that most humans vary along two TBC1D3 clusters where human haplotypes are highly variable in copy number, differing by as many as 20 copies, and structure (structural heterozygosity 90%). We also show evidence of positive selection, as well as a significant change in the predicted human TBC1D3 protein sequence. Lastly, we find that, despite multiple duplications, human TBC1D3 expression is limited to a subset of copies and, most notably, from a single paralog group: TBC1D3-CDKL. These observations may help explain why a gene potentially important in cortical development can be so variable in the human population.

5.

Structurally divergent and recurrently mutated regions of primate genomes.

Mao, Yafei; Harvey, William T; Porubsky, David; Munson, Katherine M; Hoekzema, Kendra; Lewis, Alexandra P; Audano, Peter A; Rozanski, Allison; Yang, Xiangyu; Zhang, Shilong; Yoo, DongAhn; Gordon, David S; Fair, Tyler; Wei, Xiaoxi; Logsdon, Glennis A; Haukness, Marina; Dishuck, Philip C; Jeong, Hyeonsoo; Del Rosario, Ricardo; Bauer, Vanessa L; Fattor, Will T; Wilkerson, Gregory K; Mao, Yuxiang; Shi, Yongyong; Sun, Qiang; Lu, Qing; Paten, Benedict; Bakken, Trygve E; Pollen, Alex A; Feng, Guoping; Sawyer, Sara L; Warren, Wesley C; Carbone, Lucia; Eichler, Evan E.

Cell ; 187(6): 1547-1562.e13, 2024 Mar 14.

Article in English | MEDLINE | ID: mdl-38428424

ABSTRACT

We sequenced and assembled using multiple long-read sequencing technologies the genomes of chimpanzee, bonobo, gorilla, orangutan, gibbon, macaque, owl monkey, and marmoset. We identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. We estimate that 819.47 Mbp or â¼27% of the genome has been affected by SVs across primate evolution. We identify 1,607 structurally divergent regions wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (e.g., CARD, C4, and OLAH gene families) and additional lineage-specific genes are generated (e.g., CKAP2, VPS36, ACBD7, and NEK5 paralogs), becoming targets of rapid chromosomal diversification and positive selection (e.g., RGPD gene family). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species.

Subject(s)

Genome , Primates , Animals , Humans , Base Sequence , Primates/classification , Primates/genetics , Biological Evolution , Sequence Analysis, DNA , Genomic Structural Variation

6.

Complete chromosome 21 centromere sequences from a Down syndrome family reveal size asymmetry and differences in kinetochore attachment.

Mastrorosa, F Kumara; Rozanski, Allison N; Harvey, William T; Knuth, Jordan; Garcia, Gage; Munson, Katherine M; Hoekzema, Kendra; Logsdon, Glennis A; Eichler, Evan E.

bioRxiv ; 2024 Feb 26.

Article in English | MEDLINE | ID: mdl-38464314

ABSTRACT

Down syndrome is the most common form of human intellectual disability caused by precocious segregation and nondisjunction of chromosome 21. Differences in centromere structure have been hypothesized to play a potential role in this process in addition to the well-established risk of advancing maternal age. Using long-read sequencing, we completely sequenced and assembled the centromeres from a parent-child trio where Trisomy 21 arose in the child as a result of a meiosis I error. The proband carries three distinct chromosome 21 centromere haplotypes that vary by 11-fold in length--both the largest (H1) and smallest (H2) originating from the mother. The longest H1 allele harbors a less clearly defined centromere dip region (CDR) as defined by CpG methylation and a significantly reduced signal by CENP-A chromatin immunoprecipitation sequencing when compared to H2 or paternal H3 centromeres. These epigenetic signatures suggest less competent kinetochore attachment for the maternally transmitted H1. Analysis of H1 in the mother indicates that the reduced CENP-A ChIP-seq signal, but not the CDR profile, pre-existed the meiotic nondisjunction event. A comparison of the three proband centromeres to a population sampling of 35 completely sequenced chromosome 21 centromeres shows that H2 is the smallest centromere sequenced to date and all three haplotypes (H1-H3) share a common origin of ~15 thousand years ago. These results suggest that recent asymmetry in size and epigenetic differences of chromosome 21 centromeres may contribute to nondisjunction risk.

7.

The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes.

Makova, Kateryna D; Pickett, Brandon D; Harris, Robert S; Hartley, Gabrielle A; Cechova, Monika; Pal, Karol; Nurk, Sergey; Yoo, DongAhn; Li, Qiuhui; Hebbar, Prajna; McGrath, Barbara C; Antonacci, Francesca; Aubel, Margaux; Biddanda, Arjun; Borchers, Matthew; Bomberg, Erich; Bouffard, Gerard G; Brooks, Shelise Y; Carbone, Lucia; Carrel, Laura; Carroll, Andrew; Chang, Pi-Chuan; Chin, Chen-Shan; Cook, Daniel E; Craig, Sarah J C; de Gennaro, Luciana; Diekhans, Mark; Dutra, Amalia; Garcia, Gage H; Grady, Patrick G S; Green, Richard E; Haddad, Diana; Hallast, Pille; Harvey, William T; Hickey, Glenn; Hillis, David A; Hoyt, Savannah J; Jeong, Hyeonsoo; Kamali, Kaivan; Kosakovsky Pond, Sergei L; LaPolice, Troy M; Lee, Charles; Lewis, Alexandra P; Loh, Yong-Hwee E; Masterson, Patrick; McCoy, Rajiv C; Medvedev, Paul; Miga, Karen H; Munson, Katherine M; Pak, Evgenia.

bioRxiv ; 2023 Dec 01.

Article in English | MEDLINE | ID: mdl-38077089

ABSTRACT

Apes possess two sex chromosomes-the male-specific Y and the X shared by males and females. The Y chromosome is crucial for male reproduction, with deletions linked to infertility. The X chromosome carries genes vital for reproduction and cognition. Variation in mating patterns and brain function among great apes suggests corresponding differences in their sex chromosome structure and evolution. However, due to their highly repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the state-of-the-art experimental and computational methods developed for the telomere-to-telomere (T2T) human genome, we produced gapless, complete assemblies of the X and Y chromosomes for five great apes (chimpanzee, bonobo, gorilla, Bornean and Sumatran orangutans) and a lesser ape, the siamang gibbon. These assemblies completely resolved ampliconic, palindromic, and satellite sequences, including the entire centromeres, allowing us to untangle the intricacies of ape sex chromosome evolution. We found that, compared to the X, ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements. This divergence on the Y arises from the accumulation of lineage-specific ampliconic regions and palindromes (which are shared more broadly among species on the X) and from the abundance of transposable elements and satellites (which have a lower representation on the X). Our analysis of Y chromosome genes revealed lineage-specific expansions of multi-copy gene families and signatures of purifying selection. In summary, the Y exhibits dynamic evolution, while the X is more stable. Finally, mapping short-read sequencing data from >100 great ape individuals revealed the patterns of diversity and selection on their sex chromosomes, demonstrating the utility of these reference assemblies for studies of great ape evolution. These complete sex chromosome assemblies are expected to further inform conservation genetics of nonhuman apes, all of which are endangered species.

8.

Synchronized long-read genome, methylome, epigenome, and transcriptome for resolving a Mendelian condition.

Vollger, Mitchell R; Korlach, Jonas; Eldred, Kiara C; Swanson, Elliott; Underwood, Jason G; Cheng, Yong-Han H; Ranchalis, Jane; Mao, Yizi; Blue, Elizabeth E; Schwarze, Ulrike; Munson, Katherine M; Saunders, Christopher T; Wenger, Aaron M; Allworth, Aimee; Chanprasert, Sirisak; Duerden, Brittney L; Glass, Ian; Horike-Pyne, Martha; Kim, Michelle; Leppig, Kathleen A; McLaughlin, Ian J; Ogawa, Jessica; Rosenthal, Elisabeth A; Sheppeard, Sam; Sherman, Stephanie M; Strohbehn, Samuel; Yuen, Amy L; Reh, Thomas A; Byers, Peter H; Bamshad, Michael J; Hisama, Fuki M; Jarvik, Gail P; Sancak, Yasemin; Dipple, Katrina M; Stergachis, Andrew B.

bioRxiv ; 2023 Sep 27.

Article in English | MEDLINE | ID: mdl-37808736

ABSTRACT

Resolving the molecular basis of a Mendelian condition (MC) remains challenging owing to the diverse mechanisms by which genetic variants cause disease. To address this, we developed a synchronized long-read genome, methylome, epigenome, and transcriptome sequencing approach, which enables accurate single-nucleotide, insertion-deletion, and structural variant calling and diploid de novo genome assembly, and permits the simultaneous elucidation of haplotype-resolved CpG methylation, chromatin accessibility, and full-length transcript information in a single long-read sequencing run. Application of this approach to an Undiagnosed Diseases Network (UDN) participant with a chromosome X;13 balanced translocation of uncertain significance revealed that this translocation disrupted the functioning of four separate genes (NBEA, PDK3, MAB21L1, and RB1) previously associated with single-gene MCs. Notably, the function of each gene was disrupted via a distinct mechanism that required integration of the four 'omes' to resolve. These included nonsense-mediated decay, fusion transcript formation, enhancer adoption, transcriptional readthrough silencing, and inappropriate X chromosome inactivation of autosomal genes. Overall, this highlights the utility of synchronized long-read multi-omic profiling for mechanistically resolving complex phenotypes.

9.

Assembly of 43 human Y chromosomes reveals extensive complexity and variation.

Hallast, Pille; Ebert, Peter; Loftus, Mark; Yilmaz, Feyza; Audano, Peter A; Logsdon, Glennis A; Bonder, Marc Jan; Zhou, Weichen; Höps, Wolfram; Kim, Kwondo; Li, Chong; Hoyt, Savannah J; Dishuck, Philip C; Porubsky, David; Tsetsos, Fotios; Kwon, Jee Young; Zhu, Qihui; Munson, Katherine M; Hasenfeld, Patrick; Harvey, William T; Lewis, Alexandra P; Kordosky, Jennifer; Hoekzema, Kendra; O'Neill, Rachel J; Korbel, Jan O; Tyler-Smith, Chris; Eichler, Evan E; Shi, Xinghua; Beck, Christine R; Marschall, Tobias; Konkel, Miriam K; Lee, Charles.

Nature ; 621(7978): 355-364, 2023 Sep.

Article in English | MEDLINE | ID: mdl-37612510

ABSTRACT

The prevalence of highly repetitive sequences within the human Y chromosome has prevented its complete assembly to date1 and led to its systematic omission from genomic analyses. Here we present de novo assemblies of 43 Y chromosomes spanning 182,900 years of human evolution and report considerable diversity in size and structure. Half of the male-specific euchromatic region is subject to large inversions with a greater than twofold higher recurrence rate compared with all other chromosomes2. Ampliconic sequences associated with these inversions show differing mutation rates that are sequence context dependent, and some ampliconic genes exhibit evidence for concerted evolution with the acquisition and purging of lineage-specific pseudogenes. The largest heterochromatic region in the human genome, Yq12, is composed of alternating repeat arrays that show extensive variation in the number, size and distribution, but retain a 1:1 copy-number ratio. Finally, our data suggest that the boundary between the recombining pseudoautosomal region 1 and the non-recombining portions of the X and Y chromosomes lies 500 kb away from the currently established1 boundary. The availability of fully sequence-resolved Y chromosomes from multiple individuals provides a unique opportunity for identifying new associations of traits with specific Y-chromosomal variants and garnering insights into the evolution and function of complex regions of the human genome.

Subject(s)

Chromosomes, Human, Y , Evolution, Molecular , Humans , Male , Chromosomes, Human, Y/genetics , Genome, Human/genetics , Genomics , Mutation Rate , Phenotype , Euchromatin/genetics , Pseudogenes , Genetic Variation/genetics , Chromosomes, Human, X/genetics , Pseudoautosomal Regions/genetics

10.

The variation and evolution of complete human centromeres.

Logsdon, Glennis A; Rozanski, Allison N; Ryabov, Fedor; Potapova, Tamara; Shepelev, Valery A; Mao, Yafei; Rautiainen, Mikko; Koren, Sergey; Nurk, Sergey; Porubsky, David; Lucas, Julian K; Hoekzema, Kendra; Munson, Katherine M; Gerton, Jennifer L; Phillippy, Adam M; Alexandrov, Ivan A; Eichler, Evan E.

bioRxiv ; 2023 May 30.

Article in English | MEDLINE | ID: mdl-37398417

ABSTRACT

We completely sequenced and assembled all centromeres from a second human genome and used two reference sets to benchmark genetic, epigenetic, and evolutionary variation within centromeres from a diversity panel of humans and apes. We find that centromere single-nucleotide variation can increase by up to 4.1-fold relative to other genomic regions, with the caveat that up to 45.8% of centromeric sequence, on average, cannot be reliably aligned with current methods due to the emergence of new α-satellite higher-order repeat (HOR) structures and two to threefold differences in the length of the centromeres. The extent to which this occurs differs depending on the chromosome and haplotype. Comparing the two sets of complete human centromeres, we find that eight harbor distinctly different α-satellite HOR array structures and four contain novel α-satellite HOR variants in high abundance. DNA methylation and CENP-A chromatin immunoprecipitation experiments show that 26% of the centromeres differ in their kinetochore position by at least 500 kbp-a property not readily associated with novel α-satellite HORs. To understand evolutionary change, we selected six chromosomes and sequenced and assembled 31 orthologous centromeres from the common chimpanzee, orangutan, and macaque genomes. Comparative analyses reveal nearly complete turnover of α-satellite HORs, but with idiosyncratic changes in structure characteristic to each species. Phylogenetic reconstruction of human haplotypes supports limited to no recombination between the p- and q-arms of human chromosomes and reveals that novel α-satellite HORs share a monophyletic origin, providing a strategy to estimate the rate of saltatory amplification and mutation of human centromeric DNA.

11.

Increased mutation and gene conversion within human segmental duplications.

Vollger, Mitchell R; Dishuck, Philip C; Harvey, William T; DeWitt, William S; Guitart, Xavi; Goldberg, Michael E; Rozanski, Allison N; Lucas, Julian; Asri, Mobin; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Logsdon, Glennis A; Porubsky, David; Paten, Benedict; Harris, Kelley; Hsieh, PingHsun; Eichler, Evan E.

Nature ; 617(7960): 325-334, 2023 05.

Article in English | MEDLINE | ID: mdl-37165237

ABSTRACT

Single-nucleotide variants (SNVs) in segmental duplications (SDs) have not been systematically assessed because of the limitations of mapping short-read sequencing data1,2. Here we constructed 1:1 unambiguous alignments spanning high-identity SDs across 102 human haplotypes and compared the pattern of SNVs between unique and duplicated regions3,4. We find that human SNVs are elevated 60% in SDs compared to unique regions and estimate that at least 23% of this increase is due to interlocus gene conversion (IGC) with up to 4.3 megabase pairs of SD sequence converted on average per human haplotype. We develop a genome-wide map of IGC donors and acceptors, including 498 acceptor and 454 donor hotspots affecting the exons of about 800 protein-coding genes. These include 171 genes that have 'relocated' on average 1.61 megabase pairs in a subset of human haplotypes. Using a coalescent framework, we show that SD regions are slightly evolutionarily older when compared to unique sequences, probably owing to IGC. SNVs in SDs, however, show a distinct mutational spectrum: a 27.1% increase in transversions that convert cytosine to guanine or the reverse across all triplet contexts and a 7.6% reduction in the frequency of CpG-associated mutations when compared to unique DNA. We reason that these distinct mutational properties help to maintain an overall higher GC content of SD DNA compared to that of unique DNA, probably driven by GC-biased conversion between paralogous sequences5,6.

Subject(s)

Gene Conversion , Mutation , Segmental Duplications, Genomic , Humans , Gene Conversion/genetics , Genome, Human/genetics , Polymorphism, Single Nucleotide/genetics , Haplotypes/genetics , Exons/genetics , Cytosine/chemistry , Guanine/chemistry , CpG Islands/genetics

12.

A draft human pangenome reference.

Liao, Wen-Wei; Asri, Mobin; Ebler, Jana; Doerr, Daniel; Haukness, Marina; Hickey, Glenn; Lu, Shuangjia; Lucas, Julian K; Monlong, Jean; Abel, Haley J; Buonaiuto, Silvia; Chang, Xian H; Cheng, Haoyu; Chu, Justin; Colonna, Vincenza; Eizenga, Jordan M; Feng, Xiaowen; Fischer, Christian; Fulton, Robert S; Garg, Shilpa; Groza, Cristian; Guarracino, Andrea; Harvey, William T; Heumos, Simon; Howe, Kerstin; Jain, Miten; Lu, Tsung-Yu; Markello, Charles; Martin, Fergal J; Mitchell, Matthew W; Munson, Katherine M; Mwaniki, Moses Njagi; Novak, Adam M; Olsen, Hugh E; Pesout, Trevor; Porubsky, David; Prins, Pjotr; Sibbesen, Jonas A; Sirén, Jouni; Tomlinson, Chad; Villani, Flavia; Vollger, Mitchell R; Antonacci-Fulton, Lucinda L; Baid, Gunjan; Baker, Carl A; Belyaeva, Anastasiya; Billis, Konstantinos; Carroll, Andrew; Chang, Pi-Chuan; Cody, Sarah.

Nature ; 617(7960): 312-324, 2023 05.

Article in English | MEDLINE | ID: mdl-37165242

ABSTRACT

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.

Subject(s)

Genome, Human , Genomics , Humans , Diploidy , Genome, Human/genetics , Haplotypes/genetics , Sequence Analysis, DNA , Genomics/standards , Reference Standards , Cohort Studies , Alleles , Genetic Variation

13.

Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall.

Harvey, William T; Ebert, Peter; Ebler, Jana; Audano, Peter A; Munson, Katherine M; Hoekzema, Kendra; Porubsky, David; Beck, Christine R; Marschall, Tobias; Garimella, Kiran; Eichler, Evan E.

bioRxiv ; 2023 May 04.

Article in English | MEDLINE | ID: mdl-37205567

ABSTRACT

Advances in long-read sequencing (LRS) technology continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant calling precision and recall of Oxford Nanopore Technologies (ONT) and PacBio HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant calling precision and recall of SVs and indels in HiFi datasets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant callsets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.

14.

Structurally divergent and recurrently mutated regions of primate genomes.

Mao, Yafei; Harvey, William T; Porubsky, David; Munson, Katherine M; Hoekzema, Kendra; Lewis, Alexandra P; Audano, Peter A; Rozanski, Allison; Yang, Xiangyu; Zhang, Shilong; Gordon, David S; Wei, Xiaoxi; Logsdon, Glennis A; Haukness, Marina; Dishuck, Philip C; Jeong, Hyeonsoo; Del Rosario, Ricardo; Bauer, Vanessa L; Fattor, Will T; Wilkerson, Gregory K; Lu, Qing; Paten, Benedict; Feng, Guoping; Sawyer, Sara L; Warren, Wesley C; Carbone, Lucia; Eichler, Evan E.

bioRxiv ; 2023 Mar 07.

Article in English | MEDLINE | ID: mdl-36945442

ABSTRACT

To better understand the pattern of primate genome structural variation, we sequenced and assembled using multiple long-read sequencing technologies the genomes of eight nonhuman primate species, including New World monkeys (owl monkey and marmoset), Old World monkey (macaque), Asian apes (orangutan and gibbon), and African ape lineages (gorilla, bonobo, and chimpanzee). Compared to the human genome, we identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. Across 50 million years of primate evolution, we estimate that 819.47 Mbp or ~27% of the genome has been affected by SVs based on analysis of these primate lineages. We identify 1,607 structurally divergent regions (SDRs) wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (CARDs, ABCD7, OLAH) and new lineage-specific genes are generated (e.g., CKAP2, NEK5) and have become targets of rapid chromosomal diversification and positive selection (e.g., RGPDs). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species for the first time.

15.

Whole-genome long-read sequencing downsampling and its effect on variant-calling precision and recall.

Harvey, William T; Ebert, Peter; Ebler, Jana; Audano, Peter A; Munson, Katherine M; Hoekzema, Kendra; Porubsky, David; Beck, Christine R; Marschall, Tobias; Garimella, Kiran; Eichler, Evan E.

Genome Res ; 33(12): 2029-2040, 2023 Dec 27.

Article in English | MEDLINE | ID: mdl-38190646

ABSTRACT

Advances in long-read sequencing (LRS) technologies continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant-calling precision and recall of Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant-calling precision and recall of SVs and indels in HiFi data sets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant call sets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.

Subject(s)

Genomics , Nanopores , INDEL Mutation , Whole Genome Sequencing

16.

Familial long-read sequencing increases yield of de novo mutations.

Noyes, Michelle D; Harvey, William T; Porubsky, David; Sulovari, Arvis; Li, Ruiyang; Rose, Nicholas R; Audano, Peter A; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Mantere, Tuomo; Graves-Lindsay, Tina A; Sanders, Ashley D; Goodwin, Sara; Kramer, Melissa; Mokrab, Younes; Zody, Michael C; Hoischen, Alexander; Korbel, Jan O; McCombie, W Richard; Eichler, Evan E.

Am J Hum Genet ; 109(4): 631-646, 2022 04 07.

Article in English | MEDLINE | ID: mdl-35290762

ABSTRACT

Studies of de novo mutation (DNM) have typically excluded some of the most repetitive and complex regions of the genome because these regions cannot be unambiguously mapped with short-read sequencing data. To better understand the genome-wide pattern of DNM, we generated long-read sequence data from an autism parent-child quad with an affected female where no pathogenic variant had been discovered in short-read Illumina sequence data. We deeply sequenced all four individuals by using three sequencing platforms (Illumina, Oxford Nanopore, and Pacific Biosciences) and three complementary technologies (Strand-seq, optical mapping, and 10X Genomics). Using long-read sequencing, we initially discovered and validated 171 DNMs across two children-a 20% increase in the number of de novo single-nucleotide variants (SNVs) and indels when compared to short-read callsets. The number of DNMs further increased by 5% when considering a more complete human reference (T2T-CHM13) because of the recovery of events in regions absent from GRCh38 (e.g., three DNMs in heterochromatic satellites). In total, we validated 195 de novo germline mutations and 23 potential post-zygotic mosaic mutations across both children; the overall true substitution rate based on this integrated callset is at least 1.41 × 10-8 substitutions per nucleotide per generation. We also identified six de novo insertions and deletions in tandem repeats, two of which represent structural variants. We demonstrate that long-read sequencing and assembly, especially when combined with a more complete reference genome, increases the number of DNMs by >25% compared to previous studies, providing a more complete catalog of DNM compared to short-read data alone.

Subject(s)

Genomics , High-Throughput Nucleotide Sequencing , Female , Humans , Mutation/genetics , Nucleotides , Sequence Analysis, DNA , Software

17.

Segmental duplications and their variation in a complete human genome.

Vollger, Mitchell R; Guitart, Xavi; Dishuck, Philip C; Mercuri, Ludovica; Harvey, William T; Gershman, Ariel; Diekhans, Mark; Sulovari, Arvis; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Porubsky, David; Li, Ruiyang; Nurk, Sergey; Koren, Sergey; Miga, Karen H; Phillippy, Adam M; Timp, Winston; Ventura, Mario; Eichler, Evan E.

Science ; 376(6588): eabj6965, 2022 04.

Article in English | MEDLINE | ID: mdl-35357917

ABSTRACT

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human (n = 12) and nonhuman primate (n = 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.

Subject(s)

DNA Copy Number Variations , Gene Duplication , Genome, Human , Segmental Duplications, Genomic , Evolution, Molecular , GTPase-Activating Proteins/genetics , Humans , Polymorphism, Single Nucleotide , Proto-Oncogene Proteins/genetics

18.

Evidence for opposing selective forces operating on human-specific duplicated TCAF genes in Neanderthals and humans.

Hsieh, PingHsun; Dang, Vy; Vollger, Mitchell R; Mao, Yafei; Huang, Tzu-Hsueh; Dishuck, Philip C; Baker, Carl; Cantsilieris, Stuart; Lewis, Alexandra P; Munson, Katherine M; Sorensen, Melanie; Welch, AnneMarie E; Underwood, Jason G; Eichler, Evan E.

Nat Commun ; 12(1): 5118, 2021 08 25.

Article in English | MEDLINE | ID: mdl-34433829

ABSTRACT

TRP channel-associated factor 1/2 (TCAF1/TCAF2) proteins antagonistically regulate the cold-sensor protein TRPM8 in multiple human tissues. Understanding their significance has been complicated given the locus spans a gap-ridden region with complex segmental duplications in GRCh38. Using long-read sequencing, we sequence-resolve the locus, annotate full-length TCAF models in primate genomes, and show substantial human-specific TCAF copy number variation. We identify two human super haplogroups, H4 and H5, and establish that TCAF duplications originated ~1.7 million years ago but diversified only in Homo sapiens by recurrent structural mutations. Conversely, in all archaic-hominin samples the fixation for a specific H4 haplotype without duplication is likely due to positive selection. Here, our results of TCAF copy number expansion, selection signals in hominins, and differential TCAF2 expression between haplogroups and high TCAF2 and TRPM8 expression in liver and prostate in modern-day humans imply TCAF diversification among hominins potentially in response to cold or dietary adaptations.

Subject(s)

Gene Duplication , Hominidae/genetics , Membrane Proteins/genetics , Selection, Genetic , Animals , DNA Copy Number Variations , Evolution, Molecular , Genome, Human , Haplotypes , Humans , Neanderthals , Phylogeny

19.

Alpha Satellite Insertion Close to an Ancestral Centromeric Region.

Giannuzzi, Giuliana; Logsdon, Glennis A; Chatron, Nicolas; Miller, Danny E; Reversat, Julie; Munson, Katherine M; Hoekzema, Kendra; Bonnet-Dupeyron, Marie-Noëlle; Rollat-Farnier, Pierre-Antoine; Baker, Carl A; Sanlaville, Damien; Eichler, Evan E; Schluth-Bolard, Caroline; Reymond, Alexandre.

Mol Biol Evol ; 38(12): 5576-5587, 2021 12 09.

Article in English | MEDLINE | ID: mdl-34464971

ABSTRACT

Human centromeres are mainly composed of alpha satellite DNA hierarchically organized as higher-order repeats (HORs). Alpha satellite dynamics is shown by sequence homogenization in centromeric arrays and by its transfer to other centromeric locations, for example, during the maturation of new centromeres. We identified during prenatal aneuploidy diagnosis by fluorescent in situ hybridization a de novo insertion of alpha satellite DNA from the centromere of chromosome 18 (D18Z1) into cytoband 15q26. Although bound by CENP-B, this locus did not acquire centromeric functionality as demonstrated by the lack of constriction and the absence of CENP-A binding. The insertion was associated with a 2.8-kbp deletion and likely occurred in the paternal germline. The site was enriched in long terminal repeats and located â¼10 Mbp from the location where a centromere was ancestrally seeded and became inactive in the common ancestor of humans and apes 20-25 million years ago. Long-read mapping to the T2T-CHM13 human genome assembly revealed that the insertion derives from a specific region of chromosome 18 centromeric 12-mer HOR array in which the monomer size follows a regular pattern. The rearrangement did not directly disrupt any gene or predicted regulatory element and did not alter the methylation status of the surrounding region, consistent with the absence of phenotypic consequences in the carrier. This case demonstrates a likely rare but new class of structural variation that we name "alpha satellite insertion." It also expands our knowledge on alphoid DNA dynamics and conveys the possibility that alphoid arrays can relocate near vestigial centromeric sites.

Subject(s)

Centromere , Chromosomal Proteins, Non-Histone , Centromere/genetics , Centromere/metabolism , Centromere Protein B/genetics , Centromere Protein B/metabolism , Chromosomal Proteins, Non-Histone/genetics , DNA, Satellite/genetics , Humans , In Situ Hybridization, Fluorescence

20.

Targeted long-read sequencing identifies missing disease-causing variation.

Miller, Danny E; Sulovari, Arvis; Wang, Tianyun; Loucks, Hailey; Hoekzema, Kendra; Munson, Katherine M; Lewis, Alexandra P; Fuerte, Edith P Almanza; Paschal, Catherine R; Walsh, Tom; Thies, Jenny; Bennett, James T; Glass, Ian; Dipple, Katrina M; Patterson, Karynne; Bonkowski, Emily S; Nelson, Zoe; Squire, Audrey; Sikes, Megan; Beckman, Erika; Bennett, Robin L; Earl, Dawn; Lee, Winston; Allikmets, Rando; Perlman, Seth J; Chow, Penny; Hing, Anne V; Wenger, Tara L; Adam, Margaret P; Sun, Angela; Lam, Christina; Chang, Irene; Zou, Xue; Austin, Stephanie L; Huggins, Erin; Safi, Alexias; Iyengar, Apoorva K; Reddy, Timothy E; Majoros, William H; Allen, Andrew S; Crawford, Gregory E; Kishnani, Priya S; King, Mary-Claire; Cherry, Tim; Chong, Jessica X; Bamshad, Michael J; Nickerson, Deborah A; Mefford, Heather C; Doherty, Dan; Eichler, Evan E.

Am J Hum Genet ; 108(8): 1436-1449, 2021 08 05.

Article in English | MEDLINE | ID: mdl-34216551

ABSTRACT

Despite widespread clinical genetic testing, many individuals with suspected genetic conditions lack a precise diagnosis, limiting their opportunity to take advantage of state-of-the-art treatments. In some cases, testing reveals difficult-to-evaluate structural differences, candidate variants that do not fully explain the phenotype, single pathogenic variants in recessive disorders, or no variants in genes of interest. Thus, there is a need for better tools to identify a precise genetic diagnosis in individuals when conventional testing approaches have been exhausted. We performed targeted long-read sequencing (T-LRS) using adaptive sampling on the Oxford Nanopore platform on 40 individuals, 10 of whom lacked a complete molecular diagnosis. We computationally targeted up to 151 Mbp of sequence per individual and searched for pathogenic substitutions, structural variants, and methylation differences using a single data source. We detected all genomic aberrations-including single-nucleotide variants, copy number changes, repeat expansions, and methylation differences-identified by prior clinical testing. In 8/8 individuals with complex structural rearrangements, T-LRS enabled more precise resolution of the mutation, leading to changes in clinical management in one case. In ten individuals with suspected Mendelian conditions lacking a precise genetic diagnosis, T-LRS identified pathogenic or likely pathogenic variants in six and variants of uncertain significance in two others. T-LRS accurately identifies pathogenic structural variants, resolves complex rearrangements, and identifies Mendelian variants not detected by other technologies. T-LRS represents an efficient and cost-effective strategy to evaluate high-priority genes and regions or complex clinical testing results.

Subject(s)

Chromosome Aberrations , Cytogenetic Analysis/methods , Genetic Diseases, Inborn/diagnosis , Genetic Diseases, Inborn/genetics , Genetic Predisposition to Disease , Genome, Human , Mutation , DNA Copy Number Variations , Female , Genetic Testing , High-Throughput Nucleotide Sequencing , Humans , Karyotyping , Male , Sequence Analysis, DNA

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL