Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 60
Filter
1.
Nucleic Acids Res ; 2024 Jul 02.
Article in English | MEDLINE | ID: mdl-38953162

ABSTRACT

Ribosome profiling experiments support the translation of a range of novel human open reading frames. By contrast, most peptides from large-scale proteomics experiments derive from just one source, 5' untranslated regions. Across the human genome we find evidence for 192 translated upstream regions, most of which would produce protein isoforms with extended N-terminal ends. Almost all of these N-terminal extensions are from highly abundant genes, which suggests that the novel regions we detect are just the tip of the iceberg. These upstream regions have characteristics that are not typical of coding exons. Their GC-content is remarkably high, even higher than 5' regions in other genes, and a large majority have non-canonical start codons. Although some novel upstream regions have cross-species conservation - five have orthologues in invertebrates for example - the reading frames of two thirds are not conserved beyond simians. These non-conserved regions also have no evidence of purifying selection, which suggests that much of this translation is not functional. In addition, non-conserved upstream regions have significantly more peptides in cancer cell lines than would be expected, a strong indication that an aberrant or noisy translation initiation process may play an important role in translation from upstream regions.

2.
Bioinform Adv ; 4(1): vbae029, 2024.
Article in English | MEDLINE | ID: mdl-38464973

ABSTRACT

Summary: The recently published T2T-CHM13 reference assembly completed the annotation of the final 8% of the human genome. It introduced 1956 genes, close to 100 of which are predicted to be coding because they have a protein coding parent gene. Here, we confirm the coding status and functional relevance of two of these genes, paralogues of WASHC1 and GPRIN2. We find that LOC124908094, one of four novel subtelomeric WASH1 genes uncovered in the new assembly, produces the WASH1 protein that forms part of the vital actin-regulatory WASH complex. Its coding status is supported by abundant proteomics, conservation, and cDNA evidence. It was previously assumed that gene WASHC1 produced the functional WASH1 protein, but new evidence shows that WASHC1 is a human-derived duplication and likely to be one of 12 WASH1 pseudogenes in the human gene set. We also find that the T2T-CHM13 assembly has added a functionally important copy of GPRIN2 to the human gene set. We demonstrate that uniquely mapping peptides from proteomics databases support the novel LOC124900631 rather than the GRCh38 assembly GPRIN2 gene. These new additions to the set of human coding genes underlines the importance of the new T2T-CHM13 assembly. Availability and implementation: None.

3.
bioRxiv ; 2023 Jun 17.
Article in English | MEDLINE | ID: mdl-37398104

ABSTRACT

The WASH1 gene produces a protein that forms part of the developmentally important WASH complex. The WASH complex activates the Arp2/3 complex to initiate branched actin networks at the surface of endosomes. As a curiosity, the human reference gene set includes nine WASH1 genes. How many of these are pseudogenes and how many are bona fide coding genes is not clear. Eight of the nine WASH1 genes reside in rearrangement and duplication-prone subtelomeric regions. Many of these subtelomeric regions had gaps in the GRCh38 human genome assembly, but the recently published T2T-CHM13 assembly from the Telomere to Telomere (T2T) Consortium has filled in the gaps. As a result, the T2T Consortium has added four new WASH1 paralogues in previously unannotated subtelomeric regions. Here we show that one of these four novel WASH1 genes, LOC124908094, is the gene most likely to produce the functional WASH1 protein. We also demonstrate that the other twelve WASH1 genes derived from a single WASH8P pseudogene on chromosome 12. These 12 genes include WASHC1, the gene currently annotated as the functional WASH1 gene. We propose LOC124908094 should be annotated as a coding gene and all functional information relating to the WASHC1 gene on chromosome 9 should be transferred to LOC124908094. The remaining WASH1 genes, including WASHC1. should be annotated as pseudogenes. This work confirms that the T2T assembly has added at least one functionally relevant coding gene to the human reference set. It remains to be seen whether other important coding genes are missing from the GRCh38 reference assembly.

4.
Genome Biol Evol ; 14(12)2022 Dec 07.
Article in English | MEDLINE | ID: mdl-36346145

ABSTRACT

The mutually exclusive splicing of tandem duplicated exons produces protein isoforms that are identical save for a homologous region that allows for the fine tuning of protein function. Tandem duplicated exon substitution events are rare, yet highly important alternative splicing events. Most events are ancient, their isoforms are highly expressed, and they have significantly more pathogenic mutations than other splice events. Here, we analyzed the physicochemical properties and functional roles of the homologous polypeptide regions produced by the 236 tandem duplicated exon substitutions annotated in the human gene set. We find that the most important structural and functional residues in these homologous regions are maintained, and that most changes are conservative rather than drastic. Three quarters of the isoforms produced from tandem duplicated exon substitution events are tissue-specific, particularly in nervous and cardiac tissues, and tandem duplicated exon substitution events are enriched in functional terms related to structures in the brain and skeletal muscle. We find considerable evidence for the convergent evolution of tandem duplicated exon substitution events in vertebrates, arthropods, and nematodes. Twelve human gene families have orthologues with tandem duplicated exon substitution events in both Drosophila melanogaster and Caenorhabditis elegans. Six of these gene families are ion transporters, suggesting that tandem exon duplication in genes that control the flow of ions into the cell has an adaptive benefit. The ancient origins, the strong indications of tissue-specific functions, and the evidence of convergent evolution suggest that these events may have played important roles in the evolution of animal tissues and organs.


Subject(s)
Alternative Splicing , Drosophila melanogaster , Animals , Humans , Drosophila melanogaster/genetics , Drosophila melanogaster/metabolism , Exons , RNA Splicing , Protein Isoforms/genetics , Evolution, Molecular
5.
NPJ Genom Med ; 7(1): 59, 2022 Oct 18.
Article in English | MEDLINE | ID: mdl-36257961

ABSTRACT

Clinical variant interpretation is highly dependent on the choice of reference transcript. Although the longest transcript has traditionally been chosen as the reference, APPRIS principal and MANE Select transcripts, biologically supported reference sequences, are now available. In this study, we show that MANE Select and APPRIS principal transcripts are the best reference transcripts for clinical variation. APPRIS principal and MANE Select transcripts capture almost all ClinVar pathogenic variants, and they are particularly powerful over the 94% of coding genes in which they agree. We find that a vanishingly small number of ClinVar pathogenic variants affect alternative protein products. Alternative isoforms that are likely to be clinically relevant can be predicted using TRIFID scores, the highest scoring alternative transcripts are almost 700 times more likely to house pathogenic variants. We believe that APPRIS, MANE and TRIFID are essential tools for clinical variant interpretation.

6.
Genet Med ; 24(11): 2351-2366, 2022 11.
Article in English | MEDLINE | ID: mdl-36083290

ABSTRACT

PURPOSE: Germline loss-of-function variants in CTNNB1 cause neurodevelopmental disorder with spastic diplegia and visual defects (NEDSDV; OMIM 615075) and are the most frequent, recurrent monogenic cause of cerebral palsy (CP). We investigated the range of clinical phenotypes owing to disruptions of CTNNB1 to determine the association between NEDSDV and CP. METHODS: Genetic information from 404 individuals with collectively 392 pathogenic CTNNB1 variants were ascertained for the study. From these, detailed phenotypes for 52 previously unpublished individuals were collected and combined with 68 previously published individuals with comparable clinical information. The functional effects of selected CTNNB1 missense variants were assessed using TOPFlash assay. RESULTS: The phenotypes associated with pathogenic CTNNB1 variants were similar. A diagnosis of CP was not significantly associated with any set of traits that defined a specific phenotypic subgroup, indicating that CP is not additional to NEDSDV. Two CTNNB1 missense variants were dominant negative regulators of WNT signaling, highlighting the utility of the TOPFlash assay to functionally assess variants. CONCLUSION: NEDSDV is a clinically homogeneous disorder irrespective of initial clinical diagnoses, including CP, or entry points for genetic testing.


Subject(s)
Intellectual Disability , Neurodevelopmental Disorders , Humans , Phenotype , Neurodevelopmental Disorders/genetics , Wnt Signaling Pathway/genetics , Intellectual Disability/genetics , Genomics , beta Catenin/genetics
7.
Bioinformatics ; 38(Suppl_2): ii89-ii94, 2022 09 16.
Article in English | MEDLINE | ID: mdl-36124785

ABSTRACT

MOTIVATION: Selecting the splice variant that best represents a coding gene is a crucial first step in many experimental analyses, and vital for mapping clinically relevant variants. This study compares the longest isoforms, MANE Select transcripts, APPRIS principal isoforms, and expression data, and aims to determine which method is best for selecting biological important reference splice variants for large-scale analyses. RESULTS: Proteomics analyses and human genetic variation data suggest that most coding genes have a single main protein isoform. We show that APPRIS principal isoforms and MANE Select transcripts best describe these main cellular isoforms, and find that using the longest splice variant as the representative is a poor strategy. Exons unique to the longest splice isoforms are not under selective pressure, and so are unlikely to be functionally relevant. Expression data are also a poor means of selecting the main splice variant. APPRIS principal and MANE Select exons are under purifying selection, while exons specific to alternative transcripts are not. There are MANE and APPRIS representatives for almost 95% of genes, and where they agree they are particularly effective, coinciding with the main proteomics isoform for over 98.2% of genes. AVAILABILITY AND IMPLEMENTATION: APPRIS principal isoforms for human, mouse and other model species can be downloaded from the APPRIS database (https://appris.bioinfo.cnio.es), GENCODE genes (https://www.gencodegenes.org/) and the Ensembl website (https://www.ensembl.org). MANE Select transcripts for the human reference set are available from the Ensembl, GENCODE and RefSeq databases (https://www.ncbi.nlm.nih.gov/refseq/). Lists of splice variants where MANE and APPRIS coincide are available from the APPRIS database. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Proteomics , Animals , Exons , Humans , Mice , Mutation , Protein Isoforms/genetics , Protein Isoforms/metabolism
8.
Nucleic Acids Res ; 50(D1): D54-D59, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34755885

ABSTRACT

APPRIS (https://appris.bioinfo.cnio.es) is a well-established database housing annotations for protein isoforms for a range of species. APPRIS selects principal isoforms based on protein structure and function features and on cross-species conservation. Most coding genes produce a single main protein isoform and the principal isoforms chosen by the APPRIS database best represent this main cellular isoform. Human genetic data, experimental protein evidence and the distribution of clinical variants all support the relevance of APPRIS principal isoforms. APPRIS annotations and principal isoforms have now been expanded to 10 model organisms. In this paper we highlight the most recent updates to the database. APPRIS annotations have been generated for two new species, cow and chicken, the protein structural information has been augmented with reliable models from the EMBL-EBI AlphaFold database, and we have substantially expanded the confirmatory proteomics evidence available for the human genome. The most significant change in APPRIS has been the implementation of TRIFID functional isoform scores. TRIFID functional scores are assigned to all splice isoforms, and APPRIS uses the TRIFID functional scores and proteomics evidence to determine principal isoforms when core methods cannot.


Subject(s)
Databases, Protein , Protein Isoforms/genetics , Proteins/genetics , Proteomics , Animals , Cattle , Chickens/genetics , Humans , Protein Conformation , Protein Isoforms/classification , Proteins/chemistry , Proteins/classification
9.
Nucleic Acids Res ; 49(14): 8232-8246, 2021 08 20.
Article in English | MEDLINE | ID: mdl-34302486

ABSTRACT

Most coding genes in the human genome are annotated with multiple alternative transcripts. However, clear evidence for the functional relevance of the protein isoforms produced by these alternative transcripts is often hard to find. Alternative isoforms generated from tandem exon duplication-derived substitutions are an exception. These splice events are rare, but have important functional consequences. Here, we have catalogued the 236 tandem exon duplication-derived substitutions annotated in the GENCODE human reference set. We find that more than 90% of the events have a last common ancestor in teleost fish, so are at least 425 million years old, and twenty-one can be traced back to the Bilateria clade. Alternative isoforms generated from tandem exon duplication-derived substitutions also have significantly more clinical impact than other alternative isoforms. Tandem exon duplication-derived substitutions have >25 times as many pathogenic and likely pathogenic mutations as other alternative events. Tandem exon duplication-derived substitutions appear to have vital functional roles in the cell and may have played a prominent part in metazoan evolution.


Subject(s)
Evolution, Molecular , Fishes/genetics , Genome, Human/genetics , Protein Isoforms/genetics , Alternative Splicing/genetics , Animals , Exons/genetics , Gene Duplication/genetics , Humans , Molecular Sequence Annotation , Sequence Alignment
10.
NAR Genom Bioinform ; 3(2): lqab044, 2021 Jun.
Article in English | MEDLINE | ID: mdl-34046593

ABSTRACT

Alternative splicing of messenger RNA can generate an array of mature transcripts, but it is not clear how many go on to produce functionally relevant protein isoforms. There is only limited evidence for alternative proteins in proteomics analyses and data from population genetic variation studies indicate that most alternative exons are evolving neutrally. Determining which transcripts produce biologically important isoforms is key to understanding isoform function and to interpreting the real impact of somatic mutations and germline variations. Here we have developed a method, TRIFID, to classify the functional importance of splice isoforms. TRIFID was trained on isoforms detected in large-scale proteomics analyses and distinguishes these biologically important splice isoforms with high confidence. Isoforms predicted as functionally important by the algorithm had measurable cross species conservation and significantly fewer broken functional domains. Additionally, exons that code for these functionally important protein isoforms are under purifying selection, while exons from low scoring transcripts largely appear to be evolving neutrally. TRIFID has been developed for the human genome, but it could in principle be applied to other well-annotated species. We believe that this method will generate valuable insights into the cellular importance of alternative splicing.

11.
PLoS Comput Biol ; 16(10): e1008287, 2020 10.
Article in English | MEDLINE | ID: mdl-33017396

ABSTRACT

The role of alternative splicing is one of the great unanswered questions in cellular biology. There is strong evidence for alternative splicing at the transcript level, and transcriptomics experiments show that many splice events are tissue specific. It has been suggested that alternative splicing evolved in order to remodel tissue-specific protein-protein networks. Here we investigated the evidence for tissue-specific splicing among splice isoforms detected in a large-scale proteomics analysis. Although the data supporting alternative splicing is limited at the protein level, clear patterns emerged among the small numbers of alternative splice events that we could detect in the proteomics data. More than a third of these splice events were tissue-specific and most were ancient: over 95% of splice events that were tissue-specific in both proteomics and RNAseq analyses evolved prior to the ancestors of lobe-finned fish, at least 400 million years ago. By way of contrast, three in four alternative exons in the human gene set arose in the primate lineage, so our results cannot be extrapolated to the whole genome. Tissue-specific alternative protein forms in the proteomics analysis were particularly abundant in nervous and muscle tissues and their genes had roles related to the cytoskeleton and either the structure of muscle fibres or cell-cell connections. Our results suggest that this conserved tissue-specific alternative splicing may have played a role in the development of the vertebrate brain and heart.


Subject(s)
Alternative Splicing/genetics , Organ Specificity/genetics , Protein Isoforms , Animals , Computational Biology , Genome/genetics , Humans , Protein Isoforms/chemistry , Protein Isoforms/classification , Protein Isoforms/genetics , Proteomics
12.
NAR Genom Bioinform ; 2(1): lqz023, 2020 Mar.
Article in English | MEDLINE | ID: mdl-31886458

ABSTRACT

Transposable elements colonize genomes and with time may end up being incorporated into functional regions. SINE Alu elements, which appeared in the primate lineage, are ubiquitous in the human genome and more than a thousand overlap annotated coding exons. Although almost all Alu-derived coding exons appear to be in alternative transcripts, they have been incorporated into the main coding transcript in at least 11 genes. The extent to which Alu regions are incorporated into functional proteins is unclear, but we detected reliable peptide evidence to support the translation to protein of 33 Alu-derived exons. All but one of the Alu elements for which we detected peptides were frame-preserving and there was proportionally seven times more peptide evidence for Alu elements as for other primate exons. Despite this strong evidence for translation to protein we found no evidence of selection, either from cross species alignments or human population variation data, among these Alu-derived exons. Overall, our results confirm that SINE Alu elements have contributed to the expansion of the human proteome, and this contribution appears to be stronger than might be expected over such a relatively short evolutionary timeframe. Despite this, the biological relevance of these modifications remains open to question.

15.
Nucleic Acids Res ; 46(14): 7070-7084, 2018 08 21.
Article in English | MEDLINE | ID: mdl-29982784

ABSTRACT

Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.


Subject(s)
Genes , Antibodies , DNA Copy Number Variations , Genetic Variation , Genome, Human , Humans , Molecular Sequence Annotation , Proteins/genetics , Proteins/immunology , Proteins/metabolism , Pseudogenes
16.
Genome Res ; 2018 Feb 09.
Article in English | MEDLINE | ID: mdl-29440222

ABSTRACT

High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.

17.
Nucleic Acids Res ; 46(D1): D213-D217, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29069475

ABSTRACT

The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the 'principal' isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants.


Subject(s)
Databases, Genetic , Protein Isoforms/genetics , Alternative Splicing , Amino Acid Sequence , Animals , Humans , Models, Molecular , Molecular Sequence Annotation , Protein Conformation , Protein Isoforms/chemistry , Proteome/genetics , Reproducibility of Results , Sequence Alignment
19.
Trends Biochem Sci ; 42(2): 98-110, 2017 02.
Article in English | MEDLINE | ID: mdl-27712956

ABSTRACT

Alternative splicing is commonly believed to be a major source of cellular protein diversity. However, although many thousands of alternatively spliced transcripts are routinely detected in RNA-seq studies, reliable large-scale mass spectrometry-based proteomics analyses identify only a small fraction of annotated alternative isoforms. The clearest finding from proteomics experiments is that most human genes have a single main protein isoform, while those alternative isoforms that are identified tend to be the most biologically plausible: those with the most cross-species conservation and those that do not compromise functional domains. Indeed, most alternative exons do not seem to be under selective pressure, suggesting that a large majority of predicted alternative transcripts may not even be translated into proteins.


Subject(s)
Alternative Splicing/genetics , Proteome/genetics , Exons , Protein Isoforms/genetics , Proteomics
20.
Genome Biol ; 17(1): 251, 2016 12 14.
Article in English | MEDLINE | ID: mdl-27964752

ABSTRACT

BACKGROUND: Genomic studies of endangered species provide insights into their evolution and demographic history, reveal patterns of genomic erosion that might limit their viability, and offer tools for their effective conservation. The Iberian lynx (Lynx pardinus) is the most endangered felid and a unique example of a species on the brink of extinction. RESULTS: We generate the first annotated draft of the Iberian lynx genome and carry out genome-based analyses of lynx demography, evolution, and population genetics. We identify a series of severe population bottlenecks in the history of the Iberian lynx that predate its known demographic decline during the 20th century and have greatly impacted its genome evolution. We observe drastically reduced rates of weak-to-strong substitutions associated with GC-biased gene conversion and increased rates of fixation of transposable elements. We also find multiple signatures of genetic erosion in the two remnant Iberian lynx populations, including a high frequency of potentially deleterious variants and substitutions, as well as the lowest genome-wide genetic diversity reported so far in any species. CONCLUSIONS: The genomic features observed in the Iberian lynx genome may hamper short- and long-term viability through reduced fitness and adaptive potential. The knowledge and resources developed in this study will boost the research on felid evolution and conservation genomics and will benefit the ongoing conservation and management of this emblematic species.


Subject(s)
Genetics, Population , Genome , Lynx/genetics , Animals , Endangered Species , Genetic Variation , High-Throughput Nucleotide Sequencing , Molecular Sequence Annotation , Sequence Analysis, DNA
SELECTION OF CITATIONS
SEARCH DETAIL
...