Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 33
Filter
1.
Autophagy ; 14(12): 2033-2034, 2018.
Article in English | MEDLINE | ID: mdl-30296899

ABSTRACT

I routinely see people use incorrect names for MAP1LC3/LC3 isoforms in scientific papers. In fact, it happens often enough that I decided to investigate the reason for the apparent confusion. It turns out that the sources of misinformation are abundant, including UniProt and antibody supplier web sites.


Subject(s)
Antibodies/classification , Microtubule-Associated Proteins/classification , Terminology as Topic , Autophagy-Related Proteins/chemistry , Autophagy-Related Proteins/immunology , Commerce/standards , Databases, Protein/classification , Databases, Protein/standards , Humans , Microtubule-Associated Proteins/chemistry , Microtubule-Associated Proteins/immunology , Protein Isoforms/classification , Protein Isoforms/immunology
2.
Acta Crystallogr F Struct Biol Commun ; 74(Pt 8): 463-472, 2018 08 01.
Article in English | MEDLINE | ID: mdl-30084395

ABSTRACT

Glycosylation is one of the most common forms of protein post-translational modification, but is also the most complex. Dealing with glycoproteins in structure model building, refinement, validation and PDB deposition is more error-prone than dealing with nonglycosylated proteins owing to limitations of the experimental data and available software tools. Also, experimentalists are typically less experienced in dealing with carbohydrate residues than with amino-acid residues. The results of the reannotation and re-refinement by PDB-REDO of 8114 glycoprotein structure models from the Protein Data Bank are analyzed. The positive aspects of 3620 reannotations and subsequent refinement, as well as the remaining challenges to obtaining consistently high-quality carbohydrate models, are discussed.


Subject(s)
Databases, Protein/classification , Databases, Protein/standards , Glycoproteins/chemistry , Glycoproteins/classification
3.
Sci Rep ; 6: 31971, 2016 08 18.
Article in English | MEDLINE | ID: mdl-27534507

ABSTRACT

The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules.


Subject(s)
Databases, Protein/classification , Gene Ontology , Proteins/classification , Proteome/classification , Animals , Humans , Proteins/genetics , Species Specificity
4.
Methods ; 93: 15-23, 2016 Jan 15.
Article in English | MEDLINE | ID: mdl-26318087

ABSTRACT

Argot2.5 (Annotation Retrieval of Gene Ontology Terms) is a web server designed to predict protein function. It is an updated version of the previous Argot2 enriched with new features in order to enhance its usability and its overall performance. The algorithmic strategy exploits the grouping of Gene Ontology terms by means of semantic similarity to infer protein function. The tool has been challenged over two independent benchmarks and compared to Argot2, PANNZER, and a baseline method relying on BLAST, proving to obtain a better performance thanks to the contribution of some key interventions in critical steps of the working pipeline. The most effective changes regard: (a) the selection of the input data from sequence similarity searches performed against a clustered version of UniProt databank and a remodeling of the weights given to Pfam hits, (b) the application of taxonomic constraints to filter out annotations that cannot be applied to proteins belonging to the species under investigation. The taxonomic rules are derived from our in-house developed tool, FunTaxIS, that extends those provided by the Gene Ontology consortium. The web server is free for academic users and is available online at http://www.medcomp.medicina.unipd.it/Argot2-5/.


Subject(s)
Databases, Protein/classification , Gene Ontology , Proteins/classification , Proteins/physiology , Web Browser , Algorithms , Forecasting , Internet
5.
Acta Crystallogr D Biol Crystallogr ; 69(Pt 11): 2209-15, 2013 Nov.
Article in English | MEDLINE | ID: mdl-24189232

ABSTRACT

The estimate of the root-mean-square deviation (r.m.s.d.) in coordinates between the model and the target is an essential parameter for calibrating likelihood functions for molecular replacement (MR). Good estimates of the r.m.s.d. lead to good estimates of the variance term in the likelihood functions, which increases signal to noise and hence success rates in the MR search. Phaser has hitherto used an estimate of the r.m.s.d. that only depends on the sequence identity between the model and target and which was not optimized for the MR likelihood functions. Variance-refinement functionality was added to Phaser to enable determination of the effective r.m.s.d. that optimized the log-likelihood gain (LLG) for a correct MR solution. Variance refinement was subsequently performed on a database of over 21,000 MR problems that sampled a range of sequence identities, protein sizes and protein fold classes. Success was monitored using the translation-function Z-score (TFZ), where a TFZ of 8 or over for the top peak was found to be a reliable indicator that MR had succeeded for these cases with one molecule in the asymmetric unit. Good estimates of the r.m.s.d. are correlated with the sequence identity and the protein size. A new estimate of the r.m.s.d. that uses these two parameters in a function optimized to fit the mean of the refined variance is implemented in Phaser and improves MR outcomes. Perturbing the initial estimate of the r.m.s.d. from the mean of the distribution in steps of standard deviations of the distribution further increases MR success rates.


Subject(s)
Amino Acid Sequence , Amino Acid Substitution , Databases, Protein/trends , Signal-To-Noise Ratio , Amino Acid Sequence/genetics , Amino Acid Substitution/genetics , Crystallography, X-Ray/instrumentation , Crystallography, X-Ray/methods , Databases, Protein/classification , Likelihood Functions , Models, Molecular , Mutation , Protein Folding , Sequence Alignment , Software , X-Ray Diffraction
6.
BMC Bioinformatics ; 11: 530, 2010 Oct 25.
Article in English | MEDLINE | ID: mdl-20973947

ABSTRACT

BACKGROUND: The Gene Ontology project supports categorization of gene products according to their location of action, the molecular functions that they carry out, and the processes that they are involved in. Although the ontologies are intentionally developed to be taxon neutral, and to cover all species, there are inherent taxon specificities in some branches. For example, the process 'lactation' is specific to mammals and the location 'mitochondrion' is specific to eukaryotes. The lack of an explicit formalization of these constraints can lead to errors and inconsistencies in automated and manual annotation. RESULTS: We have formalized the taxonomic constraints implicit in some GO classes, and specified these at various levels in the ontology. We have also developed an inference system that can be used to check for violations of these constraints in annotations. Using the constraints in conjunction with the inference system, we have detected and removed errors in annotations and improved the structure of the ontology. CONCLUSIONS: Detection of inconsistencies in taxon-specificity enables gradual improvement of the ontologies, the annotations, and the formalized constraints. This is progressively improving the quality of our data. The full system is available for download, and new constraints or proposed changes to constraints can be submitted online at https://sourceforge.net/tracker/?atid=605890&group_id=36855.


Subject(s)
Classification/methods , Molecular Sequence Annotation/methods , Databases, Genetic/classification , Databases, Protein/classification , Terminology as Topic , Vocabulary, Controlled
7.
BMC Struct Biol ; 9: 26, 2009 Apr 30.
Article in English | MEDLINE | ID: mdl-19402914

ABSTRACT

BACKGROUND: In addition to structural domains, most eukaryotic proteins possess intrinsically disordered (ID) regions. Although ID regions often play important functional roles, their accurate identification is difficult. As human transcription factors (TFs) constitute a typical group of proteins with long ID regions, we regarded them as a model of all proteins and attempted to accurately classify TFs into structural domains and ID regions. Although an extremely high fraction of ID regions besides DNA binding and/or other domains was detected in human TFs in our previous investigation, 20% of the residues were left unassigned. In this report, we exploit the generally higher sequence divergence in ID regions than in structural regions to completely divide proteins into structural domains and ID regions. RESULTS: The new dichotomic system first identifies domains of known structures, followed by assignment of structural domains and ID regions with a combination of pre-existing tools and a newly developed program based on sequence divergence, taking un-aligned regions into consideration. The system was found to be highly accurate: its application to a set of proteins with experimentally verified ID regions had an error rate as low as 2%. Application of this system to human TFs (401 proteins) showed that 38% of the residues were in structural domains, while 62% were in ID regions. The preponderance of ID regions makes a sharp contrast to TFs of Escherichia coli (229 proteins), in which only 5% fell in ID regions. The method also revealed that 4.0% and 11.8% of the total length in human and E. coli TFs, respectively, are comprised of structural domains whose structures have not been determined. CONCLUSION: The present system verifies that sequence divergence including information of unaligned regions is a good indicator of ID regions. The system for the first time estimates the complete fractioning of structured/un-structured regions in human TFs, also revealing structural domains without homology to known structures. These predicted novel structural domains are good targets of structural genomics. When applied to other proteins, the system is expected to uncover more novel structural domains.


Subject(s)
Bacterial Proteins , Databases, Protein/classification , Protein Folding , Sequence Analysis, Protein , Structure-Activity Relationship , Transcription Factors/chemistry , Artificial Intelligence , Computational Biology , Humans , Pattern Recognition, Automated , Protein Binding , Protein Conformation , Protein Structure, Tertiary , Software , Transcription Factors/genetics
8.
BMC Struct Biol ; 9: 27, 2009 May 01.
Article in English | MEDLINE | ID: mdl-19409097

ABSTRACT

BACKGROUND: Macromolecular docking is a challenging field of bioinformatics. Developing new algorithms is a slow process generally involving routine tasks that should be found in a robust library and not programmed from scratch for every new software application. RESULTS: We present an object-oriented Python/C++ library to help the development of new docking methods. This library contains low-level routines like PDB-format manipulation functions as well as high-level tools for docking and analyzing results. We also illustrate the ease of use of this library with the detailed implementation of a 3-body docking procedure. CONCLUSION: The PTools library can handle molecules at coarse-grained or atomic resolution and allows users to rapidly develop new software. The library is already in use for protein-protein and protein-DNA docking with the ATTRACT program and for simulation analysis. This library is freely available under the GNU GPL license, together with detailed documentation.


Subject(s)
Computational Biology/methods , Databases, Protein/classification , Proteins/chemistry , Access to Information , Algorithms , Computer Simulation , Information Storage and Retrieval , Libraries , Protein Binding , Protein Interaction Mapping , Sequence Alignment , Sequence Analysis, DNA , Sequence Analysis, Protein , Software
9.
Curr Opin Drug Discov Devel ; 12(3): 408-19, 2009 May.
Article in English | MEDLINE | ID: mdl-19396742

ABSTRACT

The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.


Subject(s)
Databases, Protein/classification , Drug Discovery/methods , Internet/trends , Databases as Topic , Models, Molecular , Mutagenesis/physiology , Protein Binding , Protein Interaction Domains and Motifs/drug effects , Software
10.
Biophys Chem ; 138(1-2): 11-22, 2008 Nov.
Article in English | MEDLINE | ID: mdl-18814947

ABSTRACT

Data reduction techniques are now a vital part of numerical analysis and principal component analysis is often used to identify important molecular features from a set of descriptors. We now take a different approach and apply data reduction techniques directly to protein structure. With this we can reduce the three-dimensional structural data into two-dimensions while preserving the correct relationships. With two-dimensional representations, structural comparisons between proteins are accelerated significantly. This means that protein-protein similarity comparisons are now feasible on a large scale. We show how the approach can help to predict the function of kinase structures according to the Hanks' classification based on their structural similarity to different kinase classes.


Subject(s)
Phosphotransferases/chemistry , Proteins/chemistry , Structural Homology, Protein , Computational Biology , Databases, Protein/classification , Models, Biological , Protein Conformation , Protein Folding , Protein Structure, Secondary , Protein Structure, Tertiary
11.
J Allergy Clin Immunol ; 121(4): 847-52.e7, 2008 Apr.
Article in English | MEDLINE | ID: mdl-18395549

ABSTRACT

BACKGROUND: Existing allergen databases classify their entries by source and route of exposure, thus lacking an evolutionary, structural, and functional classification of allergens. OBJECTIVE: We sought to build AllFam, a database of allergen families, and use it to extract common structural and functional properties of allergens. METHODS: Allergen data from the Allergome database and protein family definitions from the Pfam database were merged into AllFam, a database that is freely accessible on the Internet at http://www.meduniwien.ac.at/allergens/allfam/. A structural classification of allergens was established by matching Pfam families with families from the Structural Classification of Proteins database. Biochemical functions of allergens were extracted from the Gene Ontology Annotation database. RESULTS: Seven hundred seven allergens were classified by sequence into 134 AllFam families containing 184 Pfam domains (2% of 9318 Pfam families). A random set of 707 sequences with the same taxonomic distribution contained a significantly higher number of different Pfam domains (479 +/- 17). Classifying allergens by structure revealed that 5% of 3012 Structural Classification of Proteins families contained allergens. The biochemical functions of allergens most frequently found were limited to hydrolysis of proteins, polysaccharides, and lipids; binding of metal ions and lipids; storage; and cytoskeleton association. CONCLUSION: The small number of protein families that contain allergens and the narrow functional distribution of most allergens confirm the existence of yet unknown factors that render proteins allergenic.


Subject(s)
Allergens/chemistry , Allergens/physiology , Databases, Protein/classification , Multigene Family/immunology , Proteins/chemistry , Proteins/physiology , Terminology as Topic , Allergens/classification , Allergens/genetics , Animals , Humans , Plant Proteins/chemistry , Plant Proteins/classification , Plant Proteins/genetics , Plant Proteins/physiology , Proteins/classification , Proteins/genetics , Proteome/chemistry , Proteome/classification , Proteome/genetics , Proteome/physiology , Random Allocation , Sequence Analysis, DNA , Structure-Activity Relationship
13.
J Mol Biol ; 377(4): 1265-78, 2008 Apr 04.
Article in English | MEDLINE | ID: mdl-18313074

ABSTRACT

A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.


Subject(s)
Computational Biology , Databases, Protein , Sequence Alignment/methods , Sequence Homology, Amino Acid , Amino Acid Sequence , Databases, Protein/classification , Linear Models , Models, Molecular , Molecular Sequence Data , Probability Theory , Reproducibility of Results , Sequence Analysis, Protein/methods
14.
BMC Bioinformatics ; 9: 35, 2008 Jan 23.
Article in English | MEDLINE | ID: mdl-18215279

ABSTRACT

BACKGROUND: It has repeatedly been shown that interacting protein families tend to have similar phylogenetic trees. These similarities can be used to predicting the mapping between two families of interacting proteins (i.e. which proteins from one family interact with which members of the other). The correct mapping will be that which maximizes the similarity between the trees. The two families may eventually comprise orthologs and paralogs, if members of the two families are present in more than one organism. This fact can be exploited to restrict the possible mappings, simply by impeding links between proteins of different organisms. We present here an algorithm to predict the mapping between families of interacting proteins which is able to incorporate information regarding orthologues, or any other assignment of proteins to "classes" that may restrict possible mappings. RESULTS: For the first time in methods for predicting mappings, we have tested this new approach on a large number of interacting protein domains in order to statistically assess its performance. The method accurately predicts around 80% in the most favourable cases. We also analysed in detail the results of the method for a well defined case of interacting families, the sensor and kinase components of the Ntr-type two-component system, for which up to 98% of the pairings predicted by the method were correct. CONCLUSION: Based on the well established relationship between tree similarity and interactions we developed a method for predicting the mapping between two interacting families using genomic information alone. The program is available through a web interface.


Subject(s)
Databases, Protein/classification , Information Systems , Proteins/classification , Proteins/genetics , Forecasting , Information Systems/trends , Protein Binding/physiology , Protein Interaction Mapping/classification , Protein Interaction Mapping/methods , Proteins/metabolism , Sequence Alignment/methods , Yeasts/genetics , Yeasts/metabolism
15.
Proteins ; 67(4): 789-94, 2007 Jun 01.
Article in English | MEDLINE | ID: mdl-17380509

ABSTRACT

Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.


Subject(s)
Databases, Protein , Amino Acid Sequence , Databases, Protein/classification , Sensitivity and Specificity , Sequence Homology, Amino Acid
17.
Mol Biotechnol ; 34(1): 69-93, 2006 Sep.
Article in English | MEDLINE | ID: mdl-16943573

ABSTRACT

By organizing and making widely accessible the increasing amounts of data from high-throughput analyses, protein interaction databases have become an integral resource for the biological community in relating sequence data with higher-order function. To provide a sense of the use and applicability of these databases, we describe each of the major comprehensive interaction databases as well as some of the more specialized ones. Content description, search/browse functionalities, and data presentation are discussed. A succinct explanation of database contents helps the user quickly identify whether the database contains applicable information to their research interest. Broad levels of search/browse functions as well as descriptions/examples allow users to quickly find and access pertinent data. At this point, clear presentation of search results as well as the primary content is necessary. Many databases display information graphically or divided into smaller digestible parts over a number of tabbed/linked pages. In addition, cross-linking between the databases promotes interconnectivity of the data and is an added layer of relational data for the user. Overall, although these protein interaction databases are under continual improvement, their current state shows that much time and effort has gone into organizing and presenting these large sets of data-describing protein interactions.


Subject(s)
Databases, Protein/classification , Documentation/methods , Information Storage and Retrieval/methods , Protein Interaction Mapping/methods , Proteins/classification , Proteins/metabolism , Terminology as Topic , Database Management Systems , Proteins/chemistry , Vocabulary, Controlled
18.
Nat Biotechnol ; 24(7): 852-5, 2006 Jul.
Article in English | MEDLINE | ID: mdl-16823370

ABSTRACT

Antifreeze proteins (AFPs) are found in cold-adapted organisms and have the unusual ability to bind to and inhibit the growth of ice crystals. However, the underlying molecular basis of their ice-binding activity is unclear because of the difficulty of studying the AFP-ice interaction directly and the lack of a common motif, domain or fold among different AFPs. We have formulated a generic ice-binding model and incorporated it into a physicochemical pattern-recognition algorithm. It successfully recognizes ice-binding surfaces for a diverse range of AFPs, and clearly discriminates AFPs from other structures in the Protein Data Bank. The algorithm was used to identify a novel AFP from winter rye, and the antifreeze activity of this protein was subsequently confirmed. The presence of a common and distinct physicochemical pattern provides a structural basis for unifying AFPs from fish, insects and plants.


Subject(s)
Antifreeze Proteins/isolation & purification , Databases, Protein/classification , Algorithms , Antifreeze Proteins/classification , Models, Chemical , Molecular Sequence Data , Protein Conformation , Sequence Homology, Amino Acid , Structure-Activity Relationship
19.
BMC Bioinformatics ; 7: 187, 2006 Apr 04.
Article in English | MEDLINE | ID: mdl-16584572

ABSTRACT

BACKGROUND: The number and the arrangement of subunits that form a protein are referred to as quaternary structure. Quaternary structure is an important protein attribute that is closely related to its function. Proteins with quaternary structure are called oligomeric proteins. Oligomeric proteins are involved in various biological processes, such as metabolism, signal transduction, and chromosome replication. Thus, it is highly desirable to develop some computational methods to automatically classify the quaternary structure of proteins from their sequences. RESULTS: To explore this problem, we adopted an approach based on the functional domain composition of proteins. Every protein was represented by a vector calculated from the domains in the PFAM database. The nearest neighbor algorithm (NNA) was used for classifying the quaternary structure of proteins from this information. The jackknife cross-validation test was performed on the non-redundant protein dataset in which the sequence identity was less than 25%. The overall success rate obtained is 75.17%. Additionally, to demonstrate the effectiveness of this method, we predicted the proteins in an independent dataset and achieved an overall success rate of 84.11% CONCLUSION: Compared with the amino acid composition method and Blast, the results indicate that the domain composition approach may be a more effective and promising high-throughput method in dealing with this complicated problem in bioinformatics.


Subject(s)
Databases, Protein/classification , Protein Structure, Quaternary , Sequence Analysis, Protein/methods , Algorithms , Computational Biology/methods , Protein Structure, Quaternary/genetics , Protein Structure, Tertiary/genetics
20.
Proteins ; 63(3): 490-500, 2006 May 15.
Article in English | MEDLINE | ID: mdl-16450363

ABSTRACT

Protein-protein interactions play a key role in many biological systems. High-throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false-positive and false-negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co-complex relationship, and (3) pathway co-membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity-based k-Nearest-Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co-complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top-ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast-2-hybrid system were not among the top-ranking features under any condition.


Subject(s)
Computational Biology/classification , Computational Biology/methods , Databases, Protein/classification , Protein Interaction Mapping/classification , Protein Interaction Mapping/methods , Forecasting
SELECTION OF CITATIONS
SEARCH DETAIL
...