Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 62
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
bioRxiv ; 2024 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-38798674

RESUMO

Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein coding region. PSAURON scores can be used for genome-wide protein annotation assessment as well as the rapid identification of potentially spurious annotated proteins. Validation against established benchmarks demonstrates PSAURON's effectiveness and correlation with recognized measures of protein quality, highlighting its potential use as a general-purpose method to evaluate gene annotation. PSAURON is open source and freely available at https://github.com/salzberg-lab/PSAURON . One-Sentence Summary: PSAURON is a machine learning-based tool for rapid assessment of protein coding gene annotation.

2.
G3 (Bethesda) ; 14(5)2024 05 07.
Artigo em Inglês | MEDLINE | ID: mdl-38526344

RESUMO

Whitebark pine (WBP, Pinus albicaulis) is a white pine of subalpine regions in the Western contiguous United States and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola) and additional threats from mountain pine beetle (Dendroctonus ponderosae), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short reads of haploid megagametophyte tissue and Oxford Nanopore long reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gb of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gb). Approximately 87.2% (24.0 Gb) of total sequence was placed on the 12 WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the 3 subclasses of NLRs. Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo-assembled transcriptomes.


Assuntos
Genoma de Planta , Anotação de Sequência Molecular , Pinus , Pinus/genética , Pinus/parasitologia , Genômica/métodos , Espécies em Perigo de Extinção , Sequenciamento de Nucleotídeos em Larga Escala
3.
Curr Opin Insect Sci ; 61: 101135, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37926187

RESUMO

Insect symbionts can alter their host phenotype and their effects can range from beneficial to pathogenic. Moreover, many insects exhibit co-infections, making their study more challenging. Less than 1% of insect species have high-quality referenced genomes available and fewer still also have their symbionts sequenced. Two methods are commonly used to sequence symbionts: whole-genome sequencing to concomitantly capture the host and bacterial genomes, or isolation of the symbiont's genome before sequencing. These methods are limited when dealing with rare or poorly characterized symbionts. Long-read technology is an important tool to generate high-quality genomes as they can overcome high levels of heterozygosity, repeat content, and transposable elements that confound short-read methods. Oxford Nanopore (ONT) adaptive sampling allows a sequencing instrument to select or reject sequences in real time. We describe a method based on ONT adaptive sampling (subtractive) approach that readily permitted the sequencing of the complete genomes of mitochondria, Buchnera and its plasmids (pLeu, pTrp), and Wolbachia genomes in two aphid species, Aphis glycines and Pentalonia nigronervosa. Adaptive sampling is able to retrieve organelles such as mitochondria and symbionts that have high representation in their hosts such as Buchnera and Wolbachia, but is less successful at retrieving symbionts in low concentrations.


Assuntos
Buchnera , Nanoporos , Animais , Buchnera/genética , Elementos de DNA Transponíveis , Insetos/genética
4.
bioRxiv ; 2023 Nov 17.
Artigo em Inglês | MEDLINE | ID: mdl-38014212

RESUMO

Whitebark pine (WBP, Pinus albicaulis ) is a white pine of subalpine regions in western contiguous US and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola ) and additional threats from mountain pine beetle ( Dendroctonus ponderosae ), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short-reads of haploid megametophyte tissue and Oxford Nanopore long-reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gbp of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gbp). Approximately 87.2% (24.0 Gbp) of total sequence was placed on the twelve WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich-repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the three subclasses of NLRs (TNL, CNL, RNL). Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo assembled transcriptomes.

5.
Genome Biol Evol ; 15(7)2023 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-37364298

RESUMO

Stalk-eyed flies in the genus Teleopsis carry selfish genetic elements that induce sex ratio (SR) meiotic drive and impact the fitness of male and female carriers. Here, we assemble and describe a chromosome-level genome assembly of the stalk-eyed fly, Teleopsis dalmanni, to elucidate patterns of divergence associated with SR. The genome contains tens of thousands of transposable element (TE) insertions and hundreds of transcriptionally and insertionally active TE families. By resequencing pools of SR and ST males using short and long reads, we find widespread differentiation and divergence between XSR and XST associated with multiple nested inversions involving most of the SR haplotype. Examination of genomic coverage and gene expression data revealed seven X-linked genes with elevated expression and coverage in SR males. The most extreme and likely drive candidate involves an XSR-specific expansion of an array of partial copies of JASPer, a gene necessary for maintenance of euchromatin and associated with regulation of TE expression. In addition, we find evidence for rapid protein evolution between XSR and XST for testis expressed and novel genes, that is, either recent duplicates or lacking a Dipteran ortholog, including an X-linked duplicate of maelstrom, which is also involved in TE silencing. Overall, the evidence suggests that this ancient XSR polymorphism has had a variety of impacts on repetitive DNA and its regulation in this species.


Assuntos
Dípteros , Cromossomo X , Animais , Feminino , Masculino , Cromossomo X/genética , Dípteros/genética , Razão de Masculinidade , Olho , Testículo
6.
PLoS Comput Biol ; 19(3): e1011032, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-37000853

RESUMO

Advances in long-read sequencing technologies have dramatically improved the contiguity and completeness of genome assemblies. Using the latest nanopore-based sequencers, we can generate enough data for the assembly of a human genome from a single flow cell. With the long-read data from these sequences, we can now routinely produce de novo genome assemblies in which half or more of a genome is contained in megabase-scale contigs. Assemblies produced from nanopore data alone, though, have relatively high error rates and can benefit from a process called polishing, in which more-accurate reads are used to correct errors in the consensus sequence. In this manuscript, we present a novel tool for genome polishing called JASPER (Jellyfish-based Assembly Sequence Polisher for Error Reduction). In contrast to many other polishing methods, JASPER gains efficiency by avoiding the alignment of reads to the assembly. Instead, JASPER uses a database of k-mer counts that it creates from the reads to detect and correct errors in the consensus. Our experiments demonstrate that JASPER is faster than alignment-based polishers, and both faster and more accurate than other k-mer based polishing methods. We also introduce the idea of using a polishing tool to create population-specific reference genomes, and illustrate this idea using sequence data from multiple individuals from Tokyo, Japan.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Nanoporos , Humanos , Análise de Sequência de DNA , Genoma Humano/genética , Metagenômica
7.
G3 (Bethesda) ; 13(3)2023 03 09.
Artigo em Inglês | MEDLINE | ID: mdl-36630290

RESUMO

We used long-read DNA sequencing to assemble the genome of a Southern Han Chinese male. We organized the sequence into chromosomes and filled in gaps using the recently completed T2T-CHM13 genome as a guide, yielding a gap-free genome, Han1, containing 3,099,707,698 bases. Using the T2T-CHM13 annotation as a reference, we mapped all genes onto the Han1 genome and identified additional gene copies, generating a total of 60,708 putative genes, of which 20,003 are protein-coding. A comprehensive comparison between the genes revealed that 235 protein-coding genes were substantially different between the individuals, with frameshifts or truncations affecting the protein-coding sequence. Most of these were heterozygous variants in which one gene copy was unaffected. This represents the first gene-level comparison between two finished, annotated individual human genomes.


Assuntos
População do Leste Asiático , Genoma Humano , Humanos , Masculino , População do Leste Asiático/genética , Anotação de Sequência Molecular , Análise de Sequência de DNA
8.
Proc Natl Acad Sci U S A ; 119(28): e2122301119, 2022 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-35867761

RESUMO

The gastropod mollusk Aplysia is an important model for cellular and molecular neurobiological studies, particularly for investigations of molecular mechanisms of learning and memory. We developed an optimized assembly pipeline to generate an improved Aplysia nervous system transcriptome. This improved transcriptome enabled us to explore the evolution of cognitive capacity at the molecular level. Were there evolutionary expansions of neuronal genes between this relatively simple gastropod Aplysia (20,000 neurons) and Octopus (500 million neurons), the invertebrate with the most elaborate neuronal circuitry and greatest behavioral complexity? Are the tremendous advances in cognitive power in vertebrates explained by expansion of the synaptic proteome that resulted from multiple rounds of whole genome duplication in this clade? Overall, the complement of genes linked to neuronal function is similar between Octopus and Aplysia. As expected, a number of synaptic scaffold proteins have more isoforms in humans than in Aplysia or Octopus. However, several scaffold families present in mollusks and other protostomes are absent in vertebrates, including the Fifes, Lev10s, SOLs, and a NETO family. Thus, whereas vertebrates have more scaffold isoforms from select families, invertebrates have additional scaffold protein families not found in vertebrates. This analysis provides insights into the evolution of the synaptic proteome. Both synaptic proteins and synaptic plasticity evolved gradually, yet the last deuterostome-protostome common ancestor already possessed an elaborate suite of genes associated with synaptic function, and critical for synaptic plasticity.


Assuntos
Aplysia , Evolução Biológica , Cognição , Sinapses , Animais , Aplysia/genética , Aplysia/metabolismo , Plasticidade Neuronal/genética , Neurônios/metabolismo , Isoformas de Proteínas/genética , Proteoma , Sinapses/metabolismo , Transcriptoma
9.
Nat Commun ; 13(1): 2047, 2022 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-35440538

RESUMO

The genus Quercus, which emerged ∼55 million years ago during globally warm temperatures, diversified into ∼450 extant species. We present a high-quality de novo genome assembly of a California endemic oak, Quercus lobata, revealing features consistent with oak evolutionary success. Effective population size remained large throughout history despite declining since early Miocene. Analysis of 39,373 mapped protein-coding genes outlined copious duplications consistent with genetic and phenotypic diversity, both by retention of genes created during the ancient γ whole genome hexaploid duplication event and by tandem duplication within families, including numerous resistance genes and a very large block of duplicated DUF247 genes, which have been found to be associated with self-incompatibility in grasses. An additional surprising finding is that subcontext-specific patterns of DNA methylation associated with transposable elements reveal broadly-distributed heterochromatin in intergenic regions, similar to grasses. Collectively, these features promote genetic and phenotypic variation that would facilitate adaptability to changing environments.


Assuntos
Quercus , Evolução Biológica , Metilação de DNA/genética , Epigenoma , Evolução Molecular , Humanos , Quercus/genética
10.
PLoS Comput Biol ; 18(2): e1009860, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35120119

RESUMO

Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software
11.
G3 (Bethesda) ; 12(1)2022 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-35100403

RESUMO

Sequencing, assembly, and annotation of the 26.5 Gbp hexaploid genome of coast redwood (Sequoia sempervirens) was completed leading toward discovery of genes related to climate adaptation and investigation of the origin of the hexaploid genome. Deep-coverage short-read Illumina sequencing data from haploid tissue from a single seed were combined with long-read Oxford Nanopore Technologies sequencing data from diploid needle tissue to create an initial assembly, which was then scaffolded using proximity ligation data to produce a highly contiguous final assembly, SESE 2.1, with a scaffold N50 size of 44.9 Mbp. The assembly included several scaffolds that span entire chromosome arms, confirmed by the presence of telomere and centromere sequences on the ends of the scaffolds. The structural annotation produced 118,906 genes with 113 containing introns that exceed 500 Kbp in length and one reaching 2 Mb. Nearly 19 Gbp of the genome represented repetitive content with the vast majority characterized as long terminal repeats, with a 2.9:1 ratio of Copia to Gypsy elements that may aid in gene expression control. Comparison of coast redwood to other conifers revealed species-specific expansions for a plethora of abiotic and biotic stress response genes, including those involved in fungal disease resistance, detoxification, and physical injury/structural remodeling and others supporting flavonoid biosynthesis. Analysis of multiple genes that exist in triplicate in coast redwood but only once in its diploid relative, giant sequoia, supports a previous hypothesis that the hexaploidy is the result of autopolyploidy rather than any hybridizations with separate but closely related conifer species.


Assuntos
Sequoia , Evolução Biológica , Cromossomos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Sequoia/genética
12.
Artigo em Inglês | MEDLINE | ID: mdl-37602140

RESUMO

Kraken and KrakenUniq are widely-used tools for classifying metagenomics sequences. A key requirement for these systems is a database containing all k-mers from all genomes that the users want to be able to detect, where k = 31 by default. This database can be very large, easily exceeding 100 gigabytes (GB) and sometimes 400 GB. Previously, Kraken and KrakenUniq required loading the entire database into main memory (RAM), and if RAM was insufficient, they used memory mapping, which significantly increased the running time for large datasets. We have implemented a new algorithm in KrakenUniq that allows it to load and process the database in chunks, with only a modest increase in running time. This enhancement now makes it feasible to run KrakenUniq on very large datasets and huge databases on virtually any computer, even a laptop, while providing the same very high classification accuracy as the previous system. Statement of need: The KrakenUniq software classifies reads from metagenomic samples to establish which organisms are present in the samples and estimate their abundance. The software is widely used used by researchers and clinicians in medical diagnostics, microbiome and environmental studies.Typical databases used by KrakenUniq are tens to hundreds of gigabytes in size. The original KrakenUniq code required loading the entire database in RAM, which demanded expensive high-memory servers to run it efficiently. If a user did not have enough physical RAM to load the entire database, KrakenUniq resorted to memory-mapping the database, which significantly increased run times, frequently by a factor of more than 100. The new functionality described in this paper enables users who do not have access to high-memory servers to run KrakenUniq efficiently, with a CPU time performance increase of 3 to 4-fold, down from 100+.

13.
Genetics ; 220(2)2022 02 04.
Artigo em Inglês | MEDLINE | ID: mdl-34897437

RESUMO

Until 2019, the human genome was available in only one fully annotated version, GRCh38, which was the result of 18 years of continuous improvement and revision. Despite dramatic improvements in sequencing technology, no other genome was available as an annotated reference until 2019, when the genome of an Ashkenazi individual, Ash1, was released. In this study, we describe the assembly and annotation of a second individual genome, from a Puerto Rican individual whose DNA was collected as part of the Human Pangenome project. The new genome, called PR1, is the first true reference genome created from an individual of African descent. Due to recent improvements in both sequencing and assembly technology, and particularly to the use of the recently completed CHM13 human genome as a guide to assembly, PR1 is more complete and more contiguous than either GRCh38 or Ash1. Annotation revealed 37,755 genes (of which 19,999 are protein coding), including 12 additional gene copies that are present in PR1 and missing from CHM13. Fifty-seven genes have fewer copies in PR1 than in CHM13, 9 map only partially, and 3 genes (all noncoding) from CHM13 are entirely missing from PR1.


Assuntos
População Negra , Genoma Humano , Hispânico ou Latino/genética , Humanos , Anotação de Sequência Molecular
14.
Gigascience ; 122022 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-36762707

RESUMO

The orb web is a remarkable example of animal architecture that is observed in families of spiders that diverged over 200 million years ago. While several genomes exist for araneid orb-weavers, none exist for other orb-weaving families, hampering efforts to investigate the genetic basis of this complex behavior. Here we present a chromosome-level genome assembly for the cribellate orb-weaving spider Uloborus diversus. The assembly reinforces evidence of an ancient arachnid genome duplication and identifies complete open reading frames for every class of spidroin gene, which encode the proteins that are the key structural components of spider silks. We identified the 2 X chromosomes for U. diversus and identify candidate sex-determining loci. This chromosome-level assembly will be a valuable resource for evolutionary research into the origins of orb-weaving, spidroin evolution, chromosomal rearrangement, and chromosomal sex determination in spiders.


Assuntos
Fibroínas , Aranhas , Animais , Filogenia , Fibroínas/genética , Seda/genética , Genoma , Cromossomos Sexuais/genética , Aranhas/genética
15.
Sci Adv ; 7(26)2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-34162536

RESUMO

The American lobster, Homarus americanus, is integral to marine ecosystems and supports an important commercial fishery. This iconic species also serves as a valuable model for deciphering neural networks controlling rhythmic motor patterns and olfaction. Here, we report a high-quality draft assembly of the H. americanus genome with 25,284 predicted gene models. Analysis of the neural gene complement revealed extraordinary development of the chemosensory machinery, including a profound diversification of ligand-gated ion channels and secretory molecules. The discovery of a novel class of chimeric receptors coupling pattern recognition and neurotransmitter binding suggests a deep integration between the neural and immune systems. A robust repertoire of genes involved in innate immunity, genome stability, cell survival, chemical defense, and cuticle formation represents a diversity of defense mechanisms essential to thrive in the benthic marine environment. Together, these unique evolutionary adaptations contribute to the longevity and ecological success of this long-lived benthic predator.


Assuntos
Longevidade , Nephropidae , Animais , Ecossistema , Longevidade/genética , Nephropidae/genética , Nephropidae/metabolismo , Sistema Nervoso
16.
PLoS One ; 16(4): e0249899, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33909645

RESUMO

Rocky Mountain elk (Cervus canadensis) populations have significant economic implications to the cattle industry, as they are a major reservoir for Brucella abortus in the Greater Yellowstone area. Vaccination attempts against intracellular bacterial diseases in elk populations have not been successful due to a negligible adaptive cellular immune response. A lack of genomic resources has impeded attempts to better understand why vaccination does not induce protective immunity. To overcome this limitation, PacBio, Illumina, and Hi-C sequencing with a total of 686-fold coverage was used to assemble the elk genome into 35 pseudomolecules. A robust gene annotation was generated resulting in 18,013 gene models and 33,422 mRNAs. The accuracy of the assembly was assessed using synteny to the red deer and cattle genomes identifying several chromosomal rearrangements, fusions and fissions. Because this genome assembly and annotation provide a foundation for genome-enabled exploration of Cervus species, we demonstrate its utility by exploring the conservation of immune system-related genes. We conclude by comparing cattle immune system-related genes to the elk genome, revealing eight putative gene losses in elk.


Assuntos
Cervos/genética , Genoma , Animais , Bovinos , Fusão Gênica , Rearranjo Gênico , Imunidade/genética , Pseudogenes/genética , RNA Mensageiro/metabolismo
17.
F1000Res ; 9: 1137, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33274050

RESUMO

We sequenced the genome of the North American groundhog, Marmota monax, also known as the woodchuck. Our sequencing strategy included a combination of short, high-quality Illumina reads plus long reads generated by both Pacific Biosciences and Oxford Nanopore instruments. Assembly of the combined data produced a genome of 2.74 Gbp in total length, with an N50 contig size of 1,094,236 bp. To annotate the genome, we mapped the genes from another M. monax genome and from the closely related Alpine marmot, Marmota marmota, onto our assembly, resulting in 20,559 annotated protein-coding genes and 28,135 transcripts. The genome assembly and annotation are available in GenBank under BioProject PRJNA587092.


Assuntos
Marmota , Nanoporos , Animais , Sequência de Bases , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Marmota/genética , Estados Unidos
18.
G3 (Bethesda) ; 10(11): 3907-3919, 2020 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-32948606

RESUMO

The giant sequoia (Sequoiadendron giganteum) of California are massive, long-lived trees that grow along the U.S. Sierra Nevada mountains. Genomic data are limited in giant sequoia and producing a reference genome sequence has been an important goal to allow marker development for restoration and management. Using deep-coverage Illumina and Oxford Nanopore sequencing, combined with Dovetail chromosome conformation capture libraries, the genome was assembled into eleven chromosome-scale scaffolds containing 8.125 Gbp of sequence. Iso-Seq transcripts, assembled from three distinct tissues, was used as evidence to annotate a total of 41,632 protein-coding genes. The genome was found to contain, distributed unevenly across all 11 chromosomes and in 63 orthogroups, over 900 complete or partial predicted NLR genes, of which 375 are supported by annotation derived from protein evidence and gene modeling. This giant sequoia reference genome sequence represents the first genome sequenced in the Cupressaceae family, and lays a foundation for using genomic tools to aid in giant sequoia conservation and management.


Assuntos
Sequoiadendron , Cromossomos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Árvores
19.
Genetics ; 216(2): 599-608, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32796007

RESUMO

Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.


Assuntos
Cromossomos de Plantas/genética , Mapeamento de Sequências Contíguas/métodos , Dosagem de Genes , Triticum/genética , Mapeamento de Sequências Contíguas/normas , Genoma de Planta , Genômica/métodos , Genômica/normas , Padrões de Referência
20.
Genome Biol ; 21(1): 129, 2020 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-32487205

RESUMO

BACKGROUND: Thousands of experiments and studies use the human reference genome as a resource each year. This single reference genome, GRCh38, is a mosaic created from a small number of individuals, representing a very small sample of the human population. There is a need for reference genomes from multiple human populations to avoid potential biases. RESULTS: Here, we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are > 99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. Forty of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. Eleven genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~ 1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes. CONCLUSIONS: The Ash1 genome is presented as a reference for any genetic studies involving Ashkenazi Jewish individuals.


Assuntos
Genoma Humano , Humanos , Anotação de Sequência Molecular , Valores de Referência , Translocação Genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...