Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 41
Filter
1.
Nucleic Acids Res ; 29(14): 2994-3005, 2001 Jul 15.
Article in English | MEDLINE | ID: mdl-11452024

ABSTRACT

PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.


Subject(s)
Databases, Factual , Proteins/genetics , Sequence Alignment/methods , Software , Algorithms , Amino Acids/genetics , Animals , Computational Biology/methods , Computational Biology/statistics & numerical data , Humans , Information Storage and Retrieval , Reproducibility of Results , Sensitivity and Specificity
2.
Nucleic Acids Res ; 29(2): 351-61, 2001 Jan 15.
Article in English | MEDLINE | ID: mdl-11139604

ABSTRACT

The distribution of optimal local alignment scores of random sequences plays a vital role in evaluating the statistical significance of sequence alignments. These scores can be well described by an extreme-value distribution. The distribution's parameters depend upon the scoring system employed and the random letter frequencies; in general they cannot be derived analytically, but must be estimated by curve fitting. For obtaining accurate parameter estimates, a form of the recently described 'island' method has several advantages. We describe this method in detail, and use it to investigate the functional dependence of these parameters on finite-length edge effects.


Subject(s)
Sequence Alignment/statistics & numerical data , Statistical Distributions , Algorithms , Computational Biology/methods , Computational Biology/statistics & numerical data , Likelihood Functions , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Sequence Analysis, Protein/statistics & numerical data
3.
Genome Res ; 10(7): 1051-60, 2000 Jul.
Article in English | MEDLINE | ID: mdl-10899154

ABSTRACT

We have constructed a public gene expression data repository and online data access and analysis, WWW and FTP sites for serial analysis of gene expression (SAGE) data. The WWW and FTP components of this resource, SAGEmap, are located at http://www.ncbi.nlm.nih. gov/sage and ftp://ncbi.nlm.nih.gov/pub/sage, respectively. We herein describe SAGE data submission procedures, the construction and characteristics of SAGE tags to gene assignments, the derivation and use of a novel statistical test designed specifically for differential-type analyses of SAGE data, and the organization and use of this resource.


Subject(s)
Databases, Factual , Gene Expression/genetics , Internet , Female , Gene Library , Humans , Male , Sequence Analysis, DNA/methods , Sequence Tagged Sites , Signal Processing, Computer-Assisted
4.
Cancer Res ; 59(21): 5403-7, 1999 Nov 01.
Article in English | MEDLINE | ID: mdl-10554005

ABSTRACT

A public database, SAGEmap, was created as a component of the Cancer Genome Anatomy Project to provide a central location for depositing, retrieving, and analyzing human gene expression data. This database uses serial analysis of gene expression to quantify transcript levels in both malignant and normal human tissues. By accessing SAGEmap (http://www.ncbi.nlm.nih.gov/SAGE) the user can compare transcript populations between any of the posted libraries. As an initial demonstration of the database's utility, gene expression in human glioblastomas was compared with that of normal brain white matter. Of the 47,174 unique transcripts expressed in these two tissues, 471 (1.0%) were differentially expressed by more than 5-fold (P<0.001). Classification of these genes revealed functions consistent with the biological properties of glioblastomas, in particular: angiogenesis, transcription, and cell cycle related genes.


Subject(s)
Databases, Factual , Gene Expression , Neoplasms/genetics , Brain/metabolism , Cloning, Molecular , Glioblastoma/genetics , Humans , Internet , Models, Theoretical , RNA, Messenger/analysis , Reverse Transcriptase Polymerase Chain Reaction , Tissue Distribution
5.
Bioinformatics ; 15(12): 1000-11, 1999 Dec.
Article in English | MEDLINE | ID: mdl-10745990

ABSTRACT

MOTIVATION: Many studies have shown that database searches using position-specific score matrices (PSSMs) or profiles as queries are more effective at identifying distant protein relationships than are searches that use simple sequences as queries. One popular program for constructing a PSSM and comparing it with a database of sequences is Position-Specific Iterated BLAST (PSI-BLAST). RESULTS: This paper describes a new software package, IMPALA, designed for the complementary procedure of comparing a single query sequence with a database of PSI-BLAST-generated PSSMs. We illustrate the use of IMPALA to search a database of PSSMs for protein folds, and one for protein domains involved in signal transduction. IMPALA's sensitivity to distant biological relationships is very similar to that of PSI-BLAST. However, IMPALA employs a more refined analysis of statistical significance and, unlike PSI-BLAST, guarantees the output of the optimal local alignment by using the rigorous Smith-Waterman algorithm. Also, it is considerably faster when run with a large database of PSSMs than is BLAST or PSI-BLAST when run against the complete non-redundant protein database.


Subject(s)
Databases, Factual , Information Storage and Retrieval/methods , Sequence Analysis, Protein/methods , Software , Algorithms , Bacterial Proteins/genetics , False Negative Reactions , False Positive Reactions , Odds Ratio , Sequence Alignment , Sequence Homology
7.
J Exp Med ; 188(9): 1657-68, 1998 Nov 02.
Article in English | MEDLINE | ID: mdl-9802978

ABSTRACT

To characterize gene expression in activated mast cells more comprehensively than heretofore, we surveyed the changes in genetic transcripts by the method of serial analysis of gene expression in the RBL-2H3 line of rat mast cells before and after they were stimulated through their receptors with high affinity for immunoglobulin E (FcepsilonRI). A total of 40,759 transcripts derived from 11,300 genes were analyzed. Among the diverse genes that had not been previously associated with mast cells and that were constitutively expressed were those for the cytokine macrophage migration inhibitory factor neurohormone receptors such as growth hormone- releasing factor and melatonin and components of the exocytotic machinery. In addition, several dozen transcripts were differentially expressed in response to antigen-induced clustering of the FcepsilonRI. Included among these were the genes for preprorelaxin, mitogen-activated protein kinase kinase 3, and the dual specificity protein phosphatase, rVH6. Significantly, the majority of genes differentially expressed in this well-studied model of mast cell activation have not been identified before this analysis.


Subject(s)
Gene Expression , Mast Cells/immunology , Mast Cells/metabolism , Receptors, IgE/metabolism , Animals , Base Sequence , Cell Differentiation , DNA Primers/genetics , In Vitro Techniques , Mast Cells/cytology , Mitogen-Activated Protein Kinase Kinases , Protein Kinases/genetics , RNA, Messenger/genetics , RNA, Messenger/metabolism , Rats , Rats, Sprague-Dawley , Receptor Aggregation , Reverse Transcriptase Polymerase Chain Reaction , Signal Transduction
8.
Nucleic Acids Res ; 26(17): 3986-90, 1998 Sep 01.
Article in English | MEDLINE | ID: mdl-9705509

ABSTRACT

Protein families often are characterized by conserved sequence patterns or motifs. A researcher frequently wishes to evaluate the significance of a specific pattern within a protein, or to exploit knowledge of known motifs to aid the recognition of greatly diverged but homologous family members. To assist in these efforts, the pattern-hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein sequence and a pattern of interest that it contains. PHI-BLAST searches a protein database for other instances of the input pattern, and uses those found as seeds for the construction of local alignments to the query sequence. The random distribution of PHI-BLAST alignment scores is studied analytically and empirically. In many instances, the program is able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional single-pass database search methods. PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of DnaG-type DNA primases.


Subject(s)
Algorithms , Amino Acid Sequence , Caenorhabditis elegans Proteins , Pattern Recognition, Automated , Software , Adenosine Triphosphatases , Archaeal Proteins , Calcium-Binding Proteins , DNA Primase , Databases, Factual , HSP90 Heat-Shock Proteins , Helminth Proteins , RNA Nucleotidyltransferases
9.
Proteins ; 32(1): 88-96, 1998 Jul 01.
Article in English | MEDLINE | ID: mdl-9672045

ABSTRACT

Based on the observation that a single mutational event can delete or insert multiple residues, affine gap costs for sequence alignment charge a penalty for the existence of a gap, and a further length-dependent penalty. From structural or multiple alignments of distantly related proteins, it has been observed that conserved residues frequently fall into ungapped blocks separated by relatively nonconserved regions. To take advantage of this structure, a simple generalization of affine gap costs is proposed that allows nonconserved regions to be effectively ignored. The distribution of scores from local alignments using these generalized gap costs is shown empirically to follow an extreme value distribution. Examples are presented for which generalized affine gap costs yield superior alignments from the standpoints both of statistical significance and of alignment accuracy. Guidelines for selecting generalized affine gap costs are discussed, as is their possible application to multiple alignment.


Subject(s)
Protein Conformation , Proteins/chemistry , Sequence Alignment , Algorithms , Amino Acid Sequence , Molecular Sequence Data
10.
Nucleic Acids Res ; 25(17): 3389-402, 1997 Sep 01.
Article in English | MEDLINE | ID: mdl-9254694

ABSTRACT

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.


Subject(s)
DNA/chemistry , Databases, Factual , Proteins/chemistry , Sequence Alignment , Software , Algorithms , Amino Acid Sequence , Animals , Humans , Molecular Sequence Data
11.
FASEB J ; 11(1): 68-76, 1997 Jan.
Article in English | MEDLINE | ID: mdl-9034168

ABSTRACT

Computer analysis of a conserved domain, BRCT, first described at the carboxyl terminus of the breast cancer protein BRCA1, a p53 binding protein (53BP1), and the yeast cell cycle checkpoint protein RAD9 revealed a large superfamily of domains that occur predominantly in proteins involved in cell cycle checkpoint functions responsive to DNA damage. The BRCT domain consists of approximately 95 amino acid residues and occurs as a tandem repeat at the carboxyl terminus of numerous proteins, but has been observed also as a tandem repeat at the amino terminus or as a single copy. The BRCT superfamily presently includes approximately 40 nonorthologous proteins, namely, BRCA1, 53BP1, and RAD9; a protein family that consists of the fission yeast replication checkpoint protein Rad4, the oncoprotein ECT2, the DNA repair protein XRCC1, and yeast DNA polymerase subunit DPB11; DNA binding enzymes such as terminal deoxynucleotidyltransferases, deoxycytidyl transferase involved in DNA repair, and DNA-ligases III and IV; yeast multifunctional transcription factor RAP1; and several uncharacterized gene products. Another previously described domain that is shared by bacterial NAD-dependent DNA-ligases, the large subunits of eukaryotic replication factor C, and poly(ADP-ribose) polymerases appears to be a distinct version of the BRCT domain. The retinoblastoma protein (a universal tumor suppressor) and related proteins may contain a distant relative of the BRCT domain. Despite the functional diversity of all these proteins, participation in DNA damage-responsive checkpoints appears to be a unifying theme. Thus, the BRCT domain is likely to perform critical, yet uncharacterized, functions in the cell cycle control of organisms from bacteria to humans. The carboxyterminal BRCT domain of BRCA1 corresponds precisely to the recently identified minimal transcription activation domain of this protein, indicating one such function.


Subject(s)
BRCA1 Protein/chemistry , Cell Cycle Proteins/chemistry , DNA Damage , Amino Acid Sequence , Animals , BRCA1 Protein/physiology , Cell Cycle/physiology , Cell Cycle Proteins/physiology , Conserved Sequence , Databases, Factual , Humans , Molecular Sequence Data , Sequence Alignment , Sequence Analysis , Sequence Homology, Amino Acid
13.
Proc Natl Acad Sci U S A ; 93(9): 4345-9, 1996 Apr 30.
Article in English | MEDLINE | ID: mdl-8633068

ABSTRACT

Posttranscriptional regulation of genes of mammalian iron metabolism is mediated by the interaction of iron regulatory proteins (IRPs) with RNA stem-loop sequence elements known as iron-responsive elements (IREs). There are two identified IRPs, IRP1 and IRP2, each of which binds consensus IREs present in eukaryotic transcripts with equal affinity. Site-directed mutagenesis of IRP1 and IRP2 reveals that, although the binding affinities for consensus IREs are indistinguishable, the contributions of arginine residues in the active-site cleft to the binding affinity are different in the two RNA binding sites. Furthermore, although each IRP binds the consensus IRE with high affinity, each IRP also binds a unique alternative ligand, which was identified in an in vitro systematic evolution of ligands by exponential enrichment procedure. Differences in the two binding sites may be important in the function of the IRE-IRP regulatory system.


Subject(s)
Iron-Sulfur Proteins/metabolism , RNA, Messenger/metabolism , RNA-Binding Proteins/metabolism , Alternative Splicing , Amino Acid Sequence , Animals , Base Sequence , Binding Sites , Binding, Competitive , Cell Line , Chlorocebus aethiops , Consensus Sequence , DNA Primers , Humans , Iron Regulatory Protein 1 , Iron Regulatory Protein 2 , Iron-Regulatory Proteins , Iron-Sulfur Proteins/biosynthesis , Iron-Sulfur Proteins/chemistry , Kinetics , Molecular Sequence Data , Nucleic Acid Conformation , Point Mutation , Polymerase Chain Reaction , RNA, Messenger/chemistry , RNA-Binding Proteins/biosynthesis , RNA-Binding Proteins/chemistry , Recombinant Proteins/biosynthesis , Recombinant Proteins/chemistry , Recombinant Proteins/metabolism , Transfection
15.
Curr Opin Struct Biol ; 5(2): 236-44, 1995 Apr.
Article in English | MEDLINE | ID: mdl-7648327

ABSTRACT

The past two years have seen the rapid development of new recognition methods for protein structure prediction. These algorithms 'thread' the sequence of one protein through the known structure of another, looking for an alignment that corresponds to an energetically favorable model structure. Because they are based on energy calculation, rather than evolutionary distance, these methods extend the possibility of structure prediction by comparative modeling to a larger class of new sequences, where similarity to known structures is recognizable by no other means. The strength of the evidence they offer should be judged by objective statistical tests, however, so as to rule out the possibility that favorable scores arise from chance factors such as similarity of length, composition, or the consideration of a large number of alternative alignments. Calculation of objective p-values by analytical means is not yet possible, but it would appear that approximate values may be obtained by simulation, as they are in gapped, global sequence alignment. We propose that the results of threading experiments should include Z-scores relative to the composition-corrected score distribution obtained for shuffled and optimally aligned sequences.


Subject(s)
Protein Conformation , Sequence Alignment , Algorithms , Amino Acid Sequence , Databases, Factual , Protein Folding , Protein Structure, Secondary , Protein Structure, Tertiary
16.
Proc Natl Acad Sci U S A ; 91(25): 12091-5, 1994 Dec 06.
Article in English | MEDLINE | ID: mdl-7991589

ABSTRACT

We describe an approach to analyzing protein sequence databases that, starting from a single uncharacterized sequence or group of related sequences, generates blocks of conserved segments. The procedure involves iterative database scans with an evolving position-dependent weight matrix constructed from a coevolving set of aligned conserved segments. For each iteration, the expected distribution of matrix scores under a random model is used to set a cutoff score for the inclusion of a segment in the next iteration. This cutoff may be calculated to allow the chance inclusion of either a fixed number or a fixed proportion of false positive segments. With sufficiently high cutoff scores, the procedure converged for all alignment blocks studied, with varying numbers of iterations required. Different methods for calculating weight matrices from alignment blocks were compared. The most effective of those tested was a logarithm-of-odds, Bayesian-based approach that used prior residue probabilities calculated from a mixture of Dirichlet distributions. The procedure described was used to detect novel conserved motifs of potential biological importance.


Subject(s)
Amino Acid Sequence , Conserved Sequence , Databases, Factual , Proteins/chemistry , Proteins/genetics , Bacteria/enzymology , Bacteria/genetics , Biological Evolution , Consensus Sequence , DNA Topoisomerases, Type I/chemistry , DNA Topoisomerases, Type I/genetics , Models, Theoretical , Molecular Sequence Data , Saccharomyces cerevisiae/enzymology , Saccharomyces cerevisiae/genetics , Statistics as Topic
17.
Protein Sci ; 3(11): 2045-54, 1994 Nov.
Article in English | MEDLINE | ID: mdl-7703850

ABSTRACT

Using computer methods for multiple alignment, sequence motif search, and tertiary structure modeling, we show that eukaryotic translation elongation factor 1 gamma (EF1 gamma) contains an N-terminal domain related to class theta glutathione S-transferases (GST). GST-like proteins related to class theta comprise a large group including, in addition to typical GSTs and EF1 gamma, stress-induced proteins from bacteria and plants, bacterial reductive dehalogenases and beta-etherases, and several uncharacterized proteins. These proteins share 2 conserved sequence motifs with GSTs of other classes (alpha, mu, and pi). Tertiary structure modeling showed that in spite of the relatively low sequence similarity, the GST-related domain of EF1 gamma is likely to form a fold very similar to that in the known structures of class alpha, mu, and pi GSTs. One of the conserved motifs is implicated in glutathione binding, whereas the other motif probably is involved in maintaining the proper conformation of the GST domain. We predict that the GST-like domain in EF1 gamma is enzymatically active and that to exhibit GST activity, EF1 gamma has to form homodimers. The GST activity may be involved in the regulation of the assembly of multisubunit complexes containing EF1 and aminoacyl-tRNA synthetases by shifting the balance between glutathione, disulfide glutathione, thiol groups of cysteines, and protein disulfide bonds. The GST domain is a widespread, conserved enzymatic module that may be covalently or noncovalently complexed with other proteins. Regulation of protein assembly and folding may be 1 of the functions of GST.


Subject(s)
Glutathione Transferase/chemistry , Peptide Elongation Factors/chemistry , Amino Acid Sequence , Animals , Binding Sites/genetics , Binding Sites/physiology , Biological Evolution , Computer Graphics , Conserved Sequence/genetics , Glutathione Transferase/metabolism , Humans , Models, Molecular , Molecular Sequence Data , Peptide Elongation Factor 1 , Peptide Elongation Factors/metabolism , Protein Structure, Secondary , Protein Structure, Tertiary , Sequence Alignment
18.
Nat Genet ; 6(2): 119-29, 1994 Feb.
Article in English | MEDLINE | ID: mdl-8162065

ABSTRACT

Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.


Subject(s)
Databases, Factual , Information Storage and Retrieval , Sequence Alignment , Sequence Homology , Algorithms , Amino Acid Sequence , Animals , Base Sequence , Humans , Molecular Sequence Data , Software
19.
Science ; 262(5131): 208-14, 1993 Oct 08.
Article in English | MEDLINE | ID: mdl-8211139

ABSTRACT

A wealth of protein and DNA sequence data is being generated by genome projects and other sequencing efforts. A crucial barrier to deciphering these sequences and understanding the relations among them is the difficulty of detecting subtle local residue patterns common to multiple sequences. Such patterns frequently reflect similar molecular structures and biological properties. A mathematical definition of this "local multiple alignment" problem suitable for full computer automation has been used to develop a new and sensitive algorithm, based on the statistical method of iterative sampling. This algorithm finds an optimized local alignment model for N sequences in N-linear time, requiring only seconds on current workstations, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. The method is illustrated as applied to helix-turn-helix proteins, lipocalins, and prenyltransferases.


Subject(s)
Algorithms , Carrier Proteins/chemistry , Helix-Loop-Helix Motifs , Sequence Alignment/methods , Transferases/chemistry , Amino Acid Sequence , Models, Statistical , Molecular Sequence Data , Protein Prenylation , Protein Structure, Secondary , Software
20.
Proc Natl Acad Sci U S A ; 90(12): 5873-7, 1993 Jun 15.
Article in English | MEDLINE | ID: mdl-8390686

ABSTRACT

Score-based measures of molecular-sequence features provide versatile aids for the study of proteins and DNA. They are used by many sequence data base search programs, as well as for identifying distinctive properties of single sequences. For any such measure, it is important to know what can be expected to occur purely by chance. The statistical distribution of high-scoring segments has been described elsewhere. However, molecular sequences will frequently yield several high-scoring segments for which some combined assessment is in order. This paper describes the statistical distribution for the sum of the scores of multiple high-scoring segments and illustrates its application to the identification of possible transmembrane segments and the evaluation of sequence similarity.


Subject(s)
Amino Acid Sequence , Base Sequence , DNA , Drosophila Proteins , Proteins , Receptor Protein-Tyrosine Kinases , Sequence Analysis , Sequence Homology, Amino Acid , Animals , Antithrombin III/genetics , Biological Evolution , Chickens , Drosophila/genetics , Eye Proteins/genetics , Fowlpox virus/genetics , Humans , Membrane Glycoproteins/genetics , Molecular Sequence Data , Probability , Receptors, Cell Surface/genetics , Receptors, Serotonin/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...