Search | VHL Regional Portal

Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools.

Thomas, Paul D; Kejariwal, Anish; Guo, Nan; Mi, Huaiyu; Campbell, Michael J; Muruganujan, Anushya; Lazareva-Ulitsky, Betty.

Nucleic Acids Res ; 34(Web Server issue): W645-50, 2006 Jul 01.

Article in English | MEDLINE | ID: mdl-16912992

ABSTRACT

The vast amount of protein sequence data now available, together with accumulating experimental knowledge of protein function, enables modeling of protein sequence and function evolution. The PANTHER database was designed to model evolutionary sequence-function relationships on a large scale. There are a number of applications for these data, and we have implemented web services that address three of them. The first is a protein classification service. Proteins can be classified, using only their amino acid sequences, to evolutionary groups at both the family and subfamily levels. Specific subfamilies, and often families, are further classified when possible according to their functions, including molecular function and the biological processes and pathways they participate in. The second application, then, is an expression data analysis service, where functional classification information can help find biological patterns in the data obtained from genome-wide experiments. The third application is a coding single-nucleotide polymorphism scoring service. In this case, information about evolutionarily related proteins is used to assess the likelihood of a deleterious effect on protein function arising from a single substitution at a specific amino acid position in the protein. All three web services are available at http://www.pantherdb.org/tools.

Subject(s)

Databases, Protein , Evolution, Molecular , Polymorphism, Single Nucleotide , Proteins/genetics , Proteins/physiology , Sequence Analysis, Protein , Software , Amino Acid Substitution , Animals , Computer Graphics , Data Interpretation, Statistical , Drosophila melanogaster/genetics , Genomics , Humans , Internet , Markov Chains , Mice , Proteins/classification , RNA, Messenger/metabolism , Rats , User-Computer Interface

On the quality of tree-based protein classification.

Lazareva-Ulitsky, Betty; Diemer, Karen; Thomas, Paul D.

Bioinformatics ; 21(9): 1876-90, 2005 May 01.

Article in English | MEDLINE | ID: mdl-15647305

ABSTRACT

MOTIVATION: Phylogenetic analysis of protein sequences is widely used in protein function classification and delineation of subfamilies within larger families. In addition, the recent increase in the number of protein sequence entries with controlled vocabulary terms describing function (e.g. the Gene Ontology) suggests that it may be possible to overlay these terms onto phylogenetic trees to automatically locate functional divergence events in protein family evolution. Phylogenetic analysis of large datasets requires fast algorithms; and even 'fast', approximate distance matrix-based phylogenetic algorithms are slow on large datasets since they involve calculating maximum likelihood estimates of pairwise evolutionary distances. There have been many attempts to classify protein sequences on the family and subfamily level without reconstructing phylogenetic trees, but using hierarchical clustering with simpler distance measures, which also produce trees or dendrograms. How can these trees be compared in their ability to accurately classify protein sequences? RESULTS: Given a 'reference classification' or 'group membership labels' for a set of related protein sequences as well as a tree describing their relationships (e.g. a phylogenetic tree), we propose a method for dividing the tree into monophyletic or paraphyletic groups so as to optimize the correspondence between the reference groups and the tree-derived groups. We call the achieved optimal correspondence the 'accuracy of a tree-based classification (TBC)', which measures the ability of a tree to separate proteins of similar function into monophyletic or paraphyletic groups. We apply this measure to compare classical NJ and UPGMA phylogenetic trees with the trees obtained from hierarchical clustering using different protein similarity measures. Our preliminary analysis on a set of expert-curated protein families and alignments suggests that there is no uniformly superior algorithm, and that simple protein similarity measures combined with hierarchical clustering produce trees with reasonable and often the most accurate TBC. We used our measure to help us to design TIPS, a tree-building algorithm, based on agglomerative clustering with a similarity measure derived from profile scoring. TIPS is comparable with phylogenetic algorithms in terms of classification accuracy and is much faster on large protein families. Due to its time scalability and acceptable accuracy, TIPS is being used in the large-scale PANTHER protein classification project. The trees produced by different algorithms for different protein families can be viewed at http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp. For every tree and every level of classification granularity we provide the optimal TBC along with the reference classification. AVAILABILITY: The script that evaluates the accuracy of TBC is available at http://panther.appliedbiosystems.com/pub/tree_quality/index.jsp

Subject(s)

Algorithms , Evolution, Molecular , Phylogeny , Proteins/classification , Proteins/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Sequence Analysis, Protein/methods , Databases, Protein , Proteins/chemistry , Sequence Homology, Amino Acid , Software

The PANTHER database of protein families, subfamilies, functions and pathways.

Mi, Huaiyu; Lazareva-Ulitsky, Betty; Loo, Rozina; Kejariwal, Anish; Vandergriff, Jody; Rabkin, Steven; Guo, Nan; Muruganujan, Anushya; Doremieux, Olivier; Campbell, Michael J; Kitano, Hiroaki; Thomas, Paul D.

Nucleic Acids Res ; 33(Database issue): D284-8, 2005 Jan 01.

Article in English | MEDLINE | ID: mdl-15608197

ABSTRACT

PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (ontology terms and pathways), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences. The latest version, 5.0, contains 6683 protein families, divided into 31,705 subfamilies, covering approximately 90% of mammalian protein-coding genes. PANTHER 5.0 includes a number of significant improvements over previous versions, most notably (i) representation of pathways (primarily signaling pathways) and association with subfamilies and individual protein sequences; (ii) an improved methodology for defining the PANTHER families and subfamilies, and for building the HMMs; (iii) resources for scoring sequences against PANTHER HMMs both over the web and locally; and (iv) a number of new web resources to facilitate analysis of large gene lists, including data generated from high-throughput expression experiments. Efforts are underway to add PANTHER to the InterPro suite of databases, and to make PANTHER consistent with the PIRSF database. PANTHER is now publicly available without restriction at http://panther.appliedbiosystems.com.

Subject(s)

Databases, Protein , Proteins/classification , Sequence Analysis, Protein , Animals , Databases, Protein/statistics & numerical data , Gene Expression Profiling , Humans , Internet , Markov Chains , Mice , Proteins/chemistry , Proteins/physiology , Rats , Signal Transduction , Systems Integration , User-Computer Interface

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL