Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
Add more filters










Publication year range
1.
Database (Oxford) ; 2014: bau032, 2014.
Article in English | MEDLINE | ID: mdl-24705206

ABSTRACT

Protein databases are heavily contaminated with erroneous (mispredicted, abnormal and incomplete) sequences and these erroneous data significantly distort the conclusions drawn from genome-scale protein sequence analyses. In our earlier work we described the MisPred resource that serves to identify erroneous sequences; here we present the FixPred computational pipeline that automatically corrects sequences identified by MisPred as erroneous. The current version of the associated FixPred database contains corrected UniProtKB/Swiss-Prot and NCBI/RefSeq sequences from Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Danio rerio, Fugu rubripes, Ciona intestinalis, Branchostoma floridae, Drosophila melanogaster and Caenorhabditis elegans; future releases of the FixPred database will include corrected sequences of additional Metazoan species. The FixPred computational pipeline and database (http://www.fixpred.com) are easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. Database URL: http://www.fixpred.com.


Subject(s)
Databases, Protein , Proteins/chemistry , Software , Amino Acid Sequence , Animals , Humans
2.
Database (Oxford) ; 2013: bat053, 2013.
Article in English | MEDLINE | ID: mdl-23864220

ABSTRACT

Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in public databases. The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI/RefSeq and EnsEMBL databases. Major releases of the database are automatically generated and updated regularly. The database (http://www.mispred.com) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. DATABASE URL: http://www.mispred.com.


Subject(s)
Databases, Protein , Proteins/chemistry , Amino Acid Sequence , Animals , Internet , Molecular Sequence Annotation , Molecular Sequence Data , Sequence Analysis, Protein , User-Computer Interface , Xenopus
3.
Biochem Soc Trans ; 39(5): 1416-20, 2011 Oct.
Article in English | MEDLINE | ID: mdl-21936825

ABSTRACT

WFIKKN1 and WFIKKN2 are two closely related multidomain proteins consisting of a WAP (whey acidic protein)-, a follistatin-, an immunoglobulin-, two Kunitz-type protease inhibitor-domains and an NTR domain (netrin domain). Recent experiments have shown that both WFIKKN1 and WFIKKN2 bind myostatin and GDF11 (growth and differentiation factor 11) with high affinity and are potent antagonists of these growth factors. Structure-function studies on WFIKKN proteins have revealed that their interactions with GDF8 and GDF11 are mediated primarily by the follistatin and NTR domains.


Subject(s)
Milk Proteins/chemistry , Protein Structure, Tertiary , Proteins/chemistry , Proteins/metabolism , Animals , Humans , Intercellular Signaling Peptides and Proteins , Protein Binding , Proteins/genetics , Trypsin/metabolism , Trypsin Inhibitors/chemistry , Trypsin Inhibitors/metabolism
4.
Genes (Basel) ; 2(3): 599-607, 2011 Aug 16.
Article in English | MEDLINE | ID: mdl-26791658

ABSTRACT

We found some errors in the published versions of Figure S2, Figure S3 and Figure S8 of our paper [1]. The correct Figures are presented below. [...].

5.
Genes (Basel) ; 2(3): 449-501, 2011 Jul 13.
Article in English | MEDLINE | ID: mdl-24710207

ABSTRACT

In view of the fact that appearance of novel protein domain architectures (DA) is closely associated with biological innovations, there is a growing interest in the genome-scale reconstruction of the evolutionary history of the domain architectures of multidomain proteins. In such analyses, however, it is usually ignored that a significant proportion of Metazoan sequences analyzed is mispredicted and that this may seriously affect the validity of the conclusions. To estimate the contribution of errors in gene prediction to differences in DA of predicted proteins, we have used the high quality manually curated UniProtKB/Swiss-Prot database as a reference. For genome-scale analysis of domain architectures of predicted proteins we focused on RefSeq, EnsEMBL and NCBI's GNOMON predicted sequences of Metazoan species with completely sequenced genomes. Comparison of the DA of UniProtKB/Swiss-Prot sequences of worm, fly, zebrafish, frog, chick, mouse, rat and orangutan with those of human Swiss-Prot entries have identified relatively few cases where orthologs had different DA, although the percentage with different DA increased with evolutionary distance. In contrast with this, comparison of the DA of human, orangutan, rat, mouse, chicken, frog, zebrafish, worm and fly RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences with those of the corresponding/orthologous human Swiss-Prot entries identified a significantly higher proportion of domain architecture differences than in the case of the comparison of Swiss-Prot entries. Analysis of RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences with DAs different from those of their Swiss-Prot orthologs confirmed that the higher rate of domain architecture differences is due to errors in gene prediction, the majority of which could be corrected with our FixPred protocol. We have also demonstrated that contamination of databases with incomplete, abnormal or mispredicted sequences introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences. Here we have shown that in the case of RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences of Metazoan species, the contribution of gene prediction errors to domain architecture differences of orthologs is comparable to or greater than those due to true gene rearrangements. We have also demonstrated that domain architecture comparison may serve as a useful tool for the quality control of gene predictions and may thus guide the correction of sequence errors. Our findings caution that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of orthologous and paralogous proteins is presented in an accompanying paper [1].

6.
Genes (Basel) ; 2(3): 516-61, 2011 Aug 02.
Article in English | MEDLINE | ID: mdl-24710209

ABSTRACT

In the accompanying paper (Nagy, Szláma, Szarka, Trexler, Bányai, Patthy, Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Gene Prediction Errors) we showed that in the case of UniProtKB/TrEMBL, RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences of Metazoan species the contribution of erroneous (incomplete, abnormal, mispredicted) sequences to domain architecture (DA) differences of orthologous proteins might be greater than those of true gene rearrangements. Based on these findings, we suggest that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. In this manuscript we examine the impact of confusing paralogous and epaktologous multidomain proteins (i.e., those that are related only through the independent acquisition of the same domain types) on conclusions drawn about DA evolution of multidomain proteins in Metazoa. To estimate the contribution of this type of error we have used as reference UniProtKB/Swiss-Prot sequences from protein families with well-characterized evolutionary histories. We have used two types of paralogy-group construction procedures and monitored the impact of various parameters on the separation of true paralogs from epaktologs on correctly annotated Swiss-Prot entries of multidomain proteins. Our studies have shown that, although public protein family databases are contaminated with epaktologs, analysis of the structure of sequence similarity networks of multidomain proteins provides an efficient means for the separation of epaktologs and paralogs. We have also demonstrated that contamination of protein families with epaktologs increases the apparent rate of DA change and introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences.We have shown that confusing paralogous and epaktologous multidomain proteins significantly increases the apparent rate of DA change in Metazoa and introduces a positional bias in favor of terminal over internal DA changes. Our findings caution that earlier studies based on analysis of datasets of protein families that were contaminated with epaktologs may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of multidomain proteins is presented in an accompanying paper [1].

7.
Genes (Basel) ; 2(3): 578-98, 2011 Aug 05.
Article in English | MEDLINE | ID: mdl-24710211

ABSTRACT

In the accompanying papers we have shown that sequence errors of public databases and confusion of paralogs and epaktologs (proteins that are related only through the independent acquisition of the same domain types) significantly distort the picture that emerges from comparison of the domain architecture (DA) of multidomain Metazoan proteins since they introduce a strong bias in favor of terminal over internal DA change. The issue of whether terminal or internal DA changes occur with greater probability has very important implications for the DA evolution of multidomain proteins since gene fusion can add domains only at terminal positions, whereas domain-shuffling is capable of inserting domains both at internal and terminal positions. As a corollary, overestimation of terminal DA changes may be misinterpreted as evidence for a dominant role of gene fusion in DA evolution. In this manuscript we show that in several recent studies of DA evolution of Metazoa the authors used databases that are significantly contaminated with incomplete, abnormal and mispredicted sequences (e.g., UniProtKB/TrEMBL, EnsEMBL) and/or the authors failed to separate paralogs and epaktologs, explaining why these studies concluded that the major mechanism for gains of new domains in metazoan proteins is gene fusion. In contrast with the latter conclusion, our studies on high quality orthologous and paralogous Swiss-Prot sequences confirm that shuffling of mobile domains had a major role in the evolution of multidomain proteins of Metazoa and especially those formed in early vertebrates.

8.
Genome Biol ; 10(1): 201, 2009.
Article in English | MEDLINE | ID: mdl-19226436

ABSTRACT

The vast majority of the biology of a newly sequenced genome is inferred from the set of encoded proteins. Predicting this set is therefore invariably the first step after the completion of the genome DNA sequence. Here we review the main computational pipelines used to generate the human reference protein-coding gene sets.


Subject(s)
Computational Biology/methods , Genome/genetics , Proteins/genetics , Animals , Base Sequence , Genes , Humans
9.
BMC Bioinformatics ; 9: 353, 2008 Aug 27.
Article in English | MEDLINE | ID: mdl-18752676

ABSTRACT

BACKGROUND: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes. RESULTS: Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries. CONCLUSION: MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.


Subject(s)
Database Management Systems , Databases, Protein , Information Storage and Retrieval/methods , Internet , Natural Language Processing , Proteins/classification , Terminology as Topic , Artifacts , Proteins/chemistry , Proteins/metabolism , Quality Control , Sequence Analysis, Protein/methods
10.
Proc Natl Acad Sci U S A ; 104(13): 5495-500, 2007 Mar 27.
Article in English | MEDLINE | ID: mdl-17372197

ABSTRACT

Alternative premessenger RNA splicing enables genes to generate more than one gene product. Splicing events that occur within protein coding regions have the potential to alter the biological function of the expressed protein and even to create new protein functions. Alternative splicing has been suggested as one explanation for the discrepancy between the number of human genes and functional complexity. Here, we carry out a detailed study of the alternatively spliced gene products annotated in the ENCODE pilot project. We find that alternative splicing in human genes is more frequent than has commonly been suggested, and we demonstrate that many of the potential alternative gene products will have markedly different structure and function from their constitutively spliced counterparts. For the vast majority of these alternative isoforms, little evidence exists to suggest they have a role as functional proteins, and it seems unlikely that the spectrum of conventional enzymatic or structural functions can be substantially extended through alternative splicing.


Subject(s)
Alternative Splicing , RNA Precursors , Databases, Genetic , Gene Expression Regulation , Genome, Human , Humans , Internet , Models, Molecular , Protein Conformation , Protein Isoforms , Protein Sorting Signals , Protein Structure, Tertiary , Proteins/chemistry , RNA Splicing
12.
FEBS J ; 272(19): 5064-78, 2005 Oct.
Article in English | MEDLINE | ID: mdl-16176277

ABSTRACT

Originally the term 'protein module' was coined to distinguish mobile domains that frequently occur as building blocks of diverse multidomain proteins from 'static' domains that usually exist only as stand-alone units of single-domain proteins. Despite the widespread use of the term 'mobile domain', the distinction between static and mobile domains is rather vague as it is not easy to quantify the mobility of domains. In the present work we show that the most appropriate measure of the mobility of domains is the number of types of local environments in which a given domain is present. Ranking of domains with respect to this parameter in different evolutionary lineages highlighted marked differences in the propensity of domains to form multidomain proteins. Our analyses have also shown that there is a correlation between domain size and domain mobility: smaller domains are more likely to be used in the construction of multidomain proteins, whereas larger domains are more likely to be static, stand-alone domains. It is also shown that shuffling of a limited set of modules was facilitated by intronic recombination in the metazoan lineage and this has contributed significantly to the emergence of novel complex multidomain proteins, novel functions and increased organismic complexity of metazoa.


Subject(s)
Evolution, Molecular , Proteins/chemistry , Proteins/metabolism , Animals , Computational Biology , Exons/genetics , Models, Biological , Protein Structure, Tertiary , Proteins/genetics
13.
Eur J Biochem ; 270(9): 2101-7, 2003 May.
Article in English | MEDLINE | ID: mdl-12709070

ABSTRACT

Recently we have described a novel secreted protein (the WFIKKN protein) that consists of multiple types of protease inhibitory modules, including two tandem Kunitz-type protease inhibitor-domains. On the basis of its homologies we have suggested that the WFIKKN protein is a multivalent protease inhibitor that may control the action of different proteases. In the present work we have expressed the second Kunitz-type protease inhibitor domain of the human protein WFIKKN in Escherichia coli, purified it by affinity chromatography on trypsin-Sepharose and its structure was characterized by CD spectroscopy. The recombinant protein was found to inhibit trypsin (Ki = 9.6 nm), but chymotrypsin, elastase, plasmin, pancreatic kallikrein, lung tryptase, plasma kallikrein, thrombin, urokinase or tissue plasminogen activator were not inhibited by the recombinant protein even at 1 microm concentration. In view of the marked trypsin-specificity of the inhibitor it is suggested that its physiological target may be trypsin.


Subject(s)
Protease Inhibitors/chemistry , Protein Conformation , Proteins/chemistry , Amino Acid Sequence , Animals , Circular Dichroism , Cloning, Molecular , Humans , Intercellular Signaling Peptides and Proteins , Molecular Sequence Data , Protease Inhibitors/isolation & purification , Protease Inhibitors/metabolism , Proteins/genetics , Proteins/isolation & purification , Proteins/metabolism , Recombinant Proteins/chemistry , Recombinant Proteins/genetics , Recombinant Proteins/isolation & purification , Recombinant Proteins/metabolism , Sequence Alignment , Substrate Specificity , Trypsin/metabolism , Trypsin Inhibitors/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL
...