Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 72
Filter
1.
Nature ; 622(7981): 41-47, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37794265

ABSTRACT

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.


Subject(s)
Genes , Genome, Human , Molecular Sequence Annotation , Protein Isoforms , Humans , Genome, Human/genetics , Molecular Sequence Annotation/standards , Molecular Sequence Annotation/trends , Protein Isoforms/genetics , Human Genome Project , Pseudogenes , RNA/genetics
2.
PLoS Comput Biol ; 17(7): e1008984, 2021 07.
Article in English | MEDLINE | ID: mdl-34329294

ABSTRACT

Erroneous conversion of gene names into other dates and other data types has been a frustration for computational biologists for years. We hypothesized that such errors in supplementary files might diminish after a report in 2016 highlighting the extent of the problem. To assess this, we performed a scan of supplementary files published in PubMed Central from 2014 to 2020. Overall, gene name errors continued to accumulate unabated in the period after 2016. An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers). These findings further reinforce that spreadsheets are ill-suited to use with large genomic data.


Subject(s)
Computational Biology/standards , Genes/genetics , Molecular Sequence Annotation/standards , Humans , PubMed , Software , Terminology as Topic
3.
Am J Hum Genet ; 108(9): 1551-1557, 2021 09 02.
Article in English | MEDLINE | ID: mdl-34329581

ABSTRACT

Clinical validity assessments of gene-disease associations underpin analysis and reporting in diagnostic genomics, and yet wide variability exists in practice, particularly in use of these assessments for virtual gene panel design and maintenance. Harmonization efforts are hampered by the lack of agreed terminology, agreed gene curation standards, and platforms that can be used to identify and resolve discrepancies at scale. We undertook a systematic comparison of the content of 80 virtual gene panels used in two healthcare systems by multiple diagnostic providers in the United Kingdom and Australia. The process was enabled by a shared curation platform, PanelApp, and resulted in the identification and review of 2,144 discordant gene ratings, demonstrating the utility of sharing structured gene-disease validity assessments and collaborative discordance resolution in establishing national and international consensus.


Subject(s)
Consensus , Data Curation/standards , Genetic Diseases, Inborn/genetics , Genomics/standards , Molecular Sequence Annotation/standards , Australia , Biomarkers/metabolism , Data Curation/methods , Delivery of Health Care , Gene Expression , Gene Ontology , Genetic Diseases, Inborn/diagnosis , Genetic Diseases, Inborn/pathology , Genomics/methods , Humans , Mobile Applications/supply & distribution , Terminology as Topic , United Kingdom
4.
Nature ; 594(7861): 77-81, 2021 06.
Article in English | MEDLINE | ID: mdl-33953399

ABSTRACT

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.


Subject(s)
Evolution, Molecular , Genome/genetics , Genomics , Pan paniscus/genetics , Phylogeny , Animals , Eukaryotic Initiation Factor-4A/genetics , Female , Genes , Gorilla gorilla/genetics , Molecular Sequence Annotation/standards , Pan troglodytes/genetics , Pongo/genetics , Segmental Duplications, Genomic , Sequence Analysis, DNA
5.
Genomics ; 113(1 Pt 2): 748-754, 2021 01.
Article in English | MEDLINE | ID: mdl-33053411

ABSTRACT

Next Generation Sequencing (NGS), and specifically targeted panel sequencing is the state-of-the-art in clinical genetic diagnosis of Mendelian diseases. However, the bioinformatics analysis and interpretation of the generated data can be challenging. A spotlight on the default transcript selection of a user-friendly, commercially available software that is widely used by genetics professionals, i.e. Illumina® VariantStudio®, is presented. For the sake of comparison, we employed Ensembl VEP, an open-source command-line tool, as it provides flexibility regarding transcript selection. The analysis of NGS data deriving from sequencing of 857 germline DNA samples of cancer patients indicated a concordance of 82.82% between the two software programs. Significantly, using the default transcript configuration of VariantStudio®, we failed to annotate correctly 11.45% of the identified loss-of-function variants. Our results underline the importance of cautious software and transcript selection and the need for reliable, white-box data analysis, along with bioinformatics expertise in clinical diagnostics.


Subject(s)
Genetic Testing/methods , High-Throughput Nucleotide Sequencing/methods , Molecular Sequence Annotation/methods , Neoplasms/genetics , Genetic Testing/standards , Germ-Line Mutation , High-Throughput Nucleotide Sequencing/standards , Humans , Molecular Sequence Annotation/standards , Neoplasms/diagnosis , Sensitivity and Specificity , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards
6.
Cancer Res ; 81(2): 282-288, 2021 01 15.
Article in English | MEDLINE | ID: mdl-33115802

ABSTRACT

Although next-generation sequencing is widely used in cancer to profile tumors and detect variants, most somatic variant callers used in these pipelines identify variants at the lowest possible granularity, single-nucleotide variants (SNV). As a result, multiple adjacent SNVs are called individually instead of as a multi-nucleotide variants (MNV). With this approach, the amino acid change from the individual SNV within a codon could be different from the amino acid change based on the MNV that results from combining SNV, leading to incorrect conclusions about the downstream effects of the variants. Here, we analyzed 10,383 variant call files (VCF) from the Cancer Genome Atlas (TCGA) and found 12,141 incorrectly annotated MNVs. Analysis of seven commonly mutated genes from 178 studies in cBioPortal revealed that MNVs were consistently missed in 20 of these studies, whereas they were correctly annotated in 15 more recent studies. At the BRAF V600 locus, the most common example of MNV, several public datasets reported separate BRAF V600E and BRAF V600M variants instead of a single merged V600K variant. VCFs from the TCGA Mutect2 caller were used to develop a solution to merge SNV to MNV. Our custom script used the phasing information from the SNV VCF and determined whether SNVs were at the same codon and needed to be merged into MNV before variant annotation. This study shows that institutions performing NGS sequencing for cancer genomics should incorporate the step of merging MNV as a best practice in their pipelines. SIGNIFICANCE: Identification of incorrect mutation calls in TCGA, including clinically relevant BRAF V600 and KRAS G12, will influence research and potentially clinical decisions.


Subject(s)
Genome, Human , Genomics/standards , Molecular Sequence Annotation/standards , Mutation , Neoplasms/genetics , Polymorphism, Single Nucleotide , Scientific Experimental Error/statistics & numerical data , Algorithms , High-Throughput Nucleotide Sequencing/methods , Humans , Neoplasms/pathology
7.
Proteins ; 89(2): 242-250, 2021 02.
Article in English | MEDLINE | ID: mdl-32935893

ABSTRACT

A major challenge for protein databases is reconciling information from diverse sources. This is especially difficult when some information consists of secondary, human-interpreted rather than primary data. For example, the Swiss-Prot database contains curated annotations of subcellular location that are based on predictions from protein sequence, statements in scientific articles, and published experimental evidence. The Human Protein Atlas (HPA) consists of millions of high-resolution microscopic images that show protein spatial distribution on a cellular and subcellular level. These images are manually annotated with protein subcellular locations by trained experts. The image annotations in HPA can capture the variation of subcellular location across different cell lines, tissues, or tissue states. Systematic investigation of the consistency between HPA and Swiss-Prot assignments of subcellular location, which is important for understanding and utilizing protein location data from the two databases, has not been described previously. In this paper, we quantitatively evaluate the consistency of subcellular location annotations between HPA and Swiss-Prot at multiple levels, as well as variation of protein locations across cell lines and tissues. Our results show that annotations of these two databases differ significantly in many cases, leading to proposed procedures for deriving and integrating the protein subcellular location data. We also find that proteins having highly variable locations are more likely to be biomarkers of diseases, providing support for incorporating analysis of subcellular location in protein biomarker identification and screening.


Subject(s)
Databases, Protein/standards , Molecular Sequence Annotation/standards , Proteins/metabolism , Atlases as Topic , Cell Compartmentation , Cell Line , Eukaryotic Cells/metabolism , Eukaryotic Cells/ultrastructure , Humans , Observer Variation , Proteins/chemistry , Proteins/genetics , Reproducibility of Results , Uncertainty
8.
PLoS Genet ; 16(12): e1009060, 2020 12.
Article in English | MEDLINE | ID: mdl-33320851

ABSTRACT

Gene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.


Subject(s)
Genome-Wide Association Study/methods , Molecular Sequence Annotation/methods , Algorithms , Genome-Wide Association Study/standards , Humans , Molecular Sequence Annotation/standards , Polymorphism, Genetic , Quantitative Trait Loci , Reproducibility of Results
10.
BMC Genomics ; 21(1): 708, 2020 Oct 12.
Article in English | MEDLINE | ID: mdl-33045985

ABSTRACT

BACKGROUND: Nematode model organisms such as Caenorhabditis elegans and Pristionchus pacificus are powerful systems for studying the evolution of gene function at a mechanistic level. However, the identification of P. pacificus orthologs of candidate genes known from C. elegans is complicated by the discrepancy in the quality of gene annotations, a common problem in nematode and invertebrate genomics. RESULTS: Here, we combine comparative genomic screens for suspicious gene models with community-based curation to further improve the quality of gene annotations in P. pacificus. We extend previous curations of one-to-one orthologs to larger gene families and also orphan genes. Cross-species comparisons of protein lengths, screens for atypical domain combinations and species-specific orphan genes resulted in 4311 candidate genes that were subject to community-based curation. Corrections for 2946 gene models were implemented in a new version of the P. pacificus gene annotations. The new set of gene annotations contains 28,896 genes and has a single copy ortholog completeness level of 97.6%. CONCLUSIONS: Our work demonstrates the effectiveness of comparative genomic screens to identify suspicious gene models and the scalability of community-based approaches to improve the quality of thousands of gene models. Similar community-based approaches can help to improve the quality of gene annotations in other invertebrate species, including parasitic nematodes.


Subject(s)
Molecular Sequence Annotation , Rhabditida , Animals , Caenorhabditis elegans/genetics , Genomics , Molecular Sequence Annotation/methods , Molecular Sequence Annotation/standards , Rhabditida/genetics , Species Specificity
11.
Biochemistry ; 59(35): 3258-3270, 2020 09 08.
Article in English | MEDLINE | ID: mdl-32786413

ABSTRACT

Free guanidine is increasingly recognized as a relevant molecule in biological systems. Recently, it was reported that urea carboxylase acts preferentially on guanidine, and consequently, it was considered to participate directly in guanidine biodegradation. Urea carboxylase combines with allophanate hydrolase to comprise the activity of urea amidolyase, an enzyme predominantly found in bacteria and fungi that catalyzes the carboxylation and subsequent hydrolysis of urea to ammonia and carbon dioxide. Here, we demonstrate that urea carboxylase and allophanate hydrolase from Pseudomonas syringae are insufficient to catalyze the decomposition of guanidine. Rather, guanidine is decomposed to ammonia through the combined activities of urea carboxylase, allophanate hydrolase, and two additional proteins of the DUF1989 protein family, expansively annotated as urea carboxylase-associated family proteins. These proteins comprise the subunits of a heterodimeric carboxyguanidine deiminase (CgdAB), which hydrolyzes carboxyguanidine to N-carboxyurea (allophanate). The genes encoding CgdAB colocalize with genes encoding urea carboxylase and allophanate hydrolase. However, 25% of urea carboxylase genes, including all fungal urea amidolyases, do not colocalize with cgdAB. This subset of urea carboxylases correlates with a notable Asp to Asn mutation in the carboxyltransferase active site. Consistent with this observation, we demonstrate that fungal urea amidolyase retains a strong substrate preference for urea. The combined activities of urea carboxylase, carboxyguanidine deiminase and allophanate hydrolase represent a newly recognized pathway for the biodegradation of guanidine. These findings reinforce the relevance of guanidine as a biological metabolite and reveal a broadly distributed group of enzymes that act on guanidine in bacteria.


Subject(s)
Guanidine/metabolism , Hydrolases/metabolism , Nitrogen/metabolism , Pseudomonas syringae/enzymology , Urea/metabolism , Allophanate Hydrolase/chemistry , Allophanate Hydrolase/metabolism , Ammonia/metabolism , Carbon-Nitrogen Ligases/chemistry , Carbon-Nitrogen Ligases/metabolism , Catalysis , Citrullination/physiology , Hydrolases/chemistry , Metabolic Networks and Pathways/physiology , Molecular Sequence Annotation/standards , Protein Subunits/chemistry , Protein Subunits/metabolism , Pseudomonas syringae/metabolism
12.
Nature ; 583(7818): 693-698, 2020 07.
Article in English | MEDLINE | ID: mdl-32728248

ABSTRACT

The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.


Subject(s)
Databases, Genetic , Genome/genetics , Genomics , Molecular Sequence Annotation , Animals , Binding Sites , Chromatin/genetics , Chromatin/metabolism , DNA Methylation , Databases, Genetic/standards , Databases, Genetic/trends , Gene Expression Regulation/genetics , Genome, Human/genetics , Genomics/standards , Genomics/trends , Histones/metabolism , Humans , Mice , Molecular Sequence Annotation/standards , Quality Control , Regulatory Sequences, Nucleic Acid/genetics , Transcription Factors/metabolism
13.
Nature ; 583(7817): 578-584, 2020 07.
Article in English | MEDLINE | ID: mdl-32699395

ABSTRACT

Bats possess extraordinary adaptations, including flight, echolocation, extreme longevity and unique immunity. High-quality genomes are crucial for understanding the molecular basis and evolution of these traits. Here we incorporated long-read sequencing and state-of-the-art scaffolding protocols1 to generate, to our knowledge, the first reference-quality genomes of six bat species (Rhinolophus ferrumequinum, Rousettus aegyptiacus, Phyllostomus discolor, Myotis myotis, Pipistrellus kuhlii and Molossus molossus). We integrated gene projections from our 'Tool to infer Orthologs from Genome Alignments' (TOGA) software with de novo and homology gene predictions as well as short- and long-read transcriptomics to generate highly complete gene annotations. To resolve the phylogenetic position of bats within Laurasiatheria, we applied several phylogenetic methods to comprehensive sets of orthologous protein-coding and noncoding regions of the genome, and identified a basal origin for bats within Scrotifera. Our genome-wide screens revealed positive selection on hearing-related genes in the ancestral branch of bats, which is indicative of laryngeal echolocation being an ancestral trait in this clade. We found selection and loss of immunity-related genes (including pro-inflammatory NF-κB regulators) and expansions of anti-viral APOBEC3 genes, which highlights molecular mechanisms that may contribute to the exceptional immunity of bats. Genomic integrations of diverse viruses provide a genomic record of historical tolerance to viral infection in bats. Finally, we found and experimentally validated bat-specific variation in microRNAs, which may regulate bat-specific gene-expression programs. Our reference-quality bat genomes provide the resources required to uncover and validate the genomic basis of adaptations of bats, and stimulate new avenues of research that are directly relevant to human health and disease1.


Subject(s)
Adaptation, Physiological/genetics , Chiroptera/genetics , Evolution, Molecular , Genome/genetics , Genomics/standards , Adaptation, Physiological/immunology , Animals , Chiroptera/classification , Chiroptera/immunology , DNA Transposable Elements/genetics , Immunity/genetics , Molecular Sequence Annotation/standards , Phylogeny , RNA, Untranslated/genetics , Reference Standards , Reproducibility of Results , Virus Integration/genetics , Viruses/genetics
14.
Trends Genet ; 36(7): 461-463, 2020 07.
Article in English | MEDLINE | ID: mdl-32544447

ABSTRACT

Since 2002, published miRNAs have been collected and named by the online repository miRBase. However, with 11 000 annual publications this has become challenging. Recently, four specialized miRNA databases were published, addressing particular needs for diverse scientific communities. This development provides major opportunities for the future of miRNA annotation and nomenclature.


Subject(s)
Databases, Nucleic Acid , Gene Expression Regulation , MicroRNAs/genetics , Molecular Sequence Annotation/standards , Sequence Analysis, RNA/standards , Software , Genomics , Humans
15.
Annu Rev Genomics Hum Genet ; 21: 55-79, 2020 08 31.
Article in English | MEDLINE | ID: mdl-32421357

ABSTRACT

Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.


Subject(s)
Genome, Human , Molecular Sequence Annotation/methods , Molecular Sequence Annotation/standards , Humans
16.
BMC Bioinformatics ; 21(1): 211, 2020 May 24.
Article in English | MEDLINE | ID: mdl-32448124

ABSTRACT

BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. RESULTS: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of "alerts" that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank's submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. CONCLUSION: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.


Subject(s)
Betacoronavirus , Coronavirus Infections , Databases, Nucleic Acid , Molecular Sequence Annotation , Pandemics , Pneumonia, Viral , Software , Betacoronavirus/genetics , COVID-19 , Coronavirus Infections/genetics , DNA Viruses , Genomics , Humans , Molecular Sequence Annotation/standards , Pneumonia, Viral/genetics , SARS-CoV-2 , Viruses
17.
FEMS Microbiol Rev ; 44(4): 418-431, 2020 07 01.
Article in English | MEDLINE | ID: mdl-32386204

ABSTRACT

With the rapid increase in the number of sequenced prokaryotic genomes, relying on automated gene annotation became a necessity. Multiple lines of evidence, however, suggest that current bacterial genome annotations may contain inconsistencies and are incomplete, even for so-called well-annotated genomes. We here discuss underexplored sources of protein diversity and new methodologies for high-throughput genome reannotation. The expression of multiple molecular forms of proteins (proteoforms) from a single gene, particularly driven by alternative translation initiation, is gaining interest as a prominent contributor to bacterial protein diversity. In consequence, riboproteogenomic pipelines were proposed to comprehensively capture proteoform expression in prokaryotes by the complementary use of (positional) proteomics and the direct readout of translated genomic regions using ribosome profiling. To complement these discoveries, tailored strategies are required for the functional characterization of newly discovered bacterial proteoforms.


Subject(s)
Bacteria/genetics , Bacterial Proteins/genetics , Genome, Bacterial/genetics , Molecular Sequence Annotation/standards , Proteogenomics , Bacterial Proteins/chemistry
18.
Nature ; 581(7809): 452-458, 2020 05.
Article in English | MEDLINE | ID: mdl-32461655

ABSTRACT

The acceleration of DNA sequencing in samples from patients and population studies has resulted in extensive catalogues of human genetic variation, but the interpretation of rare genetic variants remains problematic. A notable example of this challenge is the existence of disruptive variants in dosage-sensitive disease genes, even in apparently healthy individuals. Here, by manual curation of putative loss-of-function (pLoF) variants in haploinsufficient disease genes in the Genome Aggregation Database (gnomAD)1, we show that one explanation for this paradox involves alternative splicing of mRNA, which allows exons of a gene to be expressed at varying levels across different cell types. Currently, no existing annotation tool systematically incorporates information about exon expression into the interpretation of variants. We develop a transcript-level annotation metric known as the 'proportion expressed across transcripts', which quantifies isoform expression for variants. We calculate this metric using 11,706 tissue samples from the Genotype Tissue Expression (GTEx) project2 and show that it can differentiate between weakly and highly evolutionarily conserved exons, a proxy for functional importance. We demonstrate that expression-based annotation selectively filters 22.8% of falsely annotated pLoF variants found in haploinsufficient disease genes in gnomAD, while removing less than 4% of high-confidence pathogenic variants in the same genes. Finally, we apply our expression filter to the analysis of de novo variants in patients with autism spectrum disorder and intellectual disability or developmental disorders to show that pLoF variants in weakly expressed regions have similar effect sizes to those of synonymous variants, whereas pLoF variants in highly expressed exons are most strongly enriched among cases. Our annotation is fast, flexible and generalizable, making it possible for any variant file to be annotated with any isoform expression dataset, and will be valuable for the genetic diagnosis of rare diseases, the analysis of rare variant burden in complex disorders, and the curation and prioritization of variants in recall-by-genotype studies.


Subject(s)
Disease/genetics , Haploinsufficiency/genetics , Loss of Function Mutation/genetics , Molecular Sequence Annotation , Transcription, Genetic , Transcriptome/genetics , Autism Spectrum Disorder/genetics , Datasets as Topic , Developmental Disabilities/genetics , Exons/genetics , Female , Genotype , Humans , Intellectual Disability/genetics , Male , Molecular Sequence Annotation/standards , Poisson Distribution , RNA, Messenger/analysis , RNA, Messenger/genetics , Rare Diseases/diagnosis , Rare Diseases/genetics , Reproducibility of Results , Exome Sequencing
19.
Gigascience ; 9(4)2020 04 01.
Article in English | MEDLINE | ID: mdl-32315029

ABSTRACT

BACKGROUND: Jellyfish belong to the phylum Cnidaria, which occupies an important phylogenetic location in the early-branching Metazoa lineages. The jellyfish Rhopilema esculentum is an important fishery resource in China. However, the genome resource of R. esculentum has not been reported to date. FINDINGS: In this study, we constructed a chromosome-level genome assembly of R. esculentum using Pacific Biosciences, Illumina, and Hi-C sequencing technologies. The final genome assembly was ∼275.42 Mb, with a contig N50 length of 1.13 Mb. Using Hi-C technology to identify the contacts among contigs, 260.17 Mb (94.46%) of the assembled genome were anchored onto 21 pseudochromosomes with a scaffold N50 of 12.97 Mb. We identified 17,219 protein-coding genes, with an average CDS length of 1,575 bp. The genome-wide phylogenetic analysis indicated that R. esculentum might have evolved more slowly than the other scyphozoan species used in this study. In addition, 127 toxin-like genes were identified, and 1 toxin-related "hub" was found by a genomic survey. CONCLUSIONS: We have generated a chromosome-level genome assembly of R. esculentum that could provide a valuable genomic background for studying the biology and pharmacology of jellyfish, as well as the evolutionary history of Cnidaria.


Subject(s)
Chromosomes/genetics , Cnidaria/genetics , Genome/genetics , Reference Standards , Animals , China/epidemiology , Genomics/standards , High-Throughput Nucleotide Sequencing/standards , Molecular Sequence Annotation/standards
20.
Gigascience ; 9(3)2020 03 01.
Article in English | MEDLINE | ID: mdl-32170312

ABSTRACT

BACKGROUND: Over the past few years the variety of experimental designs and protocols for sequencing experiments increased greatly. To ensure the wide usability of the produced data beyond an individual project, rich and systematic annotation of the underlying experiments is crucial. FINDINGS: We first developed an annotation structure that captures the overall experimental design as well as the relevant details of the steps from the biological sample to the library preparation, the sequencing procedure, and the sequencing and processed files. Through various design features, such as controlled vocabularies and different field requirements, we ensured a high annotation quality, comparability, and ease of annotation. The structure can be easily adapted to a large variety of species. We then implemented the annotation strategy in a user-hosted web platform with data import, query, and export functionality. CONCLUSIONS: We present here an annotation structure and user-hosted platform for sequencing experiment data, suitable for lab-internal documentation, collaborations, and large-scale annotation efforts.


Subject(s)
Molecular Sequence Annotation/methods , Sequence Analysis/methods , Software , Molecular Sequence Annotation/standards , Sequence Analysis/standards
SELECTION OF CITATIONS
SEARCH DETAIL
...