Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 66
Filter
1.
Pac Symp Biocomput ; : 425-36, 2002.
Article in English | MEDLINE | ID: mdl-11928496

ABSTRACT

We report the identification of several putative muscle-specific regulatory elements, and genes which are expressed preferentially in the muscle of the nematode Caenorhabditis elegans. We used computational pattern finding methods to identify cis-regulatory motifs from promoter regions of a set of genes known to express preferentially in muscle; each motif describes the potential binding sites for an unknown regulatory factor. The significance and specificity of the identified motifs were evaluated using several different control sequence sets. Using the motifs, we searched the entire C. elegans genome for genes whose promoter regions have a high probability of being bound by the putative regulatory factors. Genes that met this criterion and were not included in our initial set were predicted to be good candidates for muscle expression. Some of these candidates are additional, known muscle expressed genes and several others are shown here to be preferentially expressed in muscle cells by using GFP (green fluorescent protein) constructs. The methods described here can be used to predict the spatial expression pattern of many uncharacterized genes.


Subject(s)
Caenorhabditis elegans/genetics , Genes, Helminth , Muscle Proteins/genetics , Animals , Base Sequence , Binding Sites , Consensus Sequence , DNA, Helminth/genetics , DNA, Helminth/metabolism , Regulatory Sequences, Nucleic Acid , Software
2.
Bioinformatics ; 17(11): 1067-76, 2001 Nov.
Article in English | MEDLINE | ID: mdl-11724738

ABSTRACT

MOTIVATION: High density DNA oligo microarrays are widely used in biomedical research. Selection of optimal DNA oligos that are deposited on the microarrays is critical. Based on sequence information and hybridization free energy, we developed a new algorithm to select optimal short (20-25 bases) or long (50 or 70 bases) oligos from genes or open reading frames (ORFs) and predict their hybridization behavior. Having optimized probes for each gene is valuable for two reasons. By minimizing background hybridization they provide more accurate determinations of true expression levels. Having optimum probes minimizes the number of probes needed per gene, thereby decreasing the cost of each microarray, raising the number of genes on each chip and increasing its usage. RESULTS: In this paper we describe algorithms to optimize the selection of specific probes for each gene in an entire genome. The criteria for truly optimum probes are easily stated but they are not computable at all levels currently. We have developed an heuristic approach that is efficiently computable at all levels and should provide a good approximation to the true optimum set. We have run the program on the complete genomes for several model organisms and deposited the results in a database that is available on-line (http://ural.wustl.edu/~lif/probe.pl). AVAILABILITY: The program is available upon request.


Subject(s)
Oligonucleotide Array Sequence Analysis/statistics & numerical data , Oligonucleotide Probes/genetics , Software , Algorithms , Base Sequence , Computational Biology , Databases, Nucleic Acid , Nucleic Acid Hybridization , Thermodynamics
4.
Bioinformatics ; 17(7): 608-21, 2001 Jul.
Article in English | MEDLINE | ID: mdl-11448879

ABSTRACT

MOTIVATION: Transcriptional activation in eukaryotic organisms normally requires combinatorial interactions of multiple transcription factors. Though several methods exist for identification of individual protein binding site patterns in DNA sequences, there are few methods for discovery of binding site patterns for cooperatively acting factors. Here we present an algorithm, Co-Bind (for COperative BINDing), for discovering DNA target sites for cooperatively acting transcription factors. The method utilizes a Gibbs sampling strategy to model the cooperativity between two transcription factors and defines position weight matrices for the binding sites. Sequences from both the training set and the entire genome are taken into account, in order to discriminate against commonly occurring patterns in the genome, and produce patterns which are significant only in the training set. RESULTS: We have tested Co-Bind on semi-synthetic and real data sets to show it can efficiently identify DNA target site patterns for cooperatively binding transcription factors. In cases where binding site patterns are weak and cannot be identified by other available methods, Co-Bind, by virtue of modeling the cooperativity between factors, can identify those sites efficiently. Though developed to model protein-DNA interactions, the scope of Co-Bind may be extended to combinatorial, sequence specific, interactions in other macromolecules. AVAILABILITY: The program is available upon request from the authors or may be downloaded from http://ural.wustl.edu.


Subject(s)
DNA-Binding Proteins/metabolism , DNA/genetics , DNA/metabolism , Algorithms , Bacterial Proteins/metabolism , Base Sequence , Binding Sites/genetics , Computational Biology , DNA, Bacterial/genetics , DNA, Bacterial/metabolism , Databases as Topic , Escherichia coli/genetics , Escherichia coli/metabolism , Genes, Fungal , Genome, Bacterial , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Software , Transcription Factors/metabolism , Transcriptional Activation
5.
Nucleic Acids Res ; 29(12): 2471-8, 2001 Jun 15.
Article in English | MEDLINE | ID: mdl-11410653

ABSTRACT

Salmonella bacteriophage repressor Mnt belongs to the ribbon-helix-helix class of transcription factors. Previous SELEX results suggested that interactions of Mnt with positions 16 and 17 of the operator DNA are not independent. Using a newly developed high-throughput quantitative multiple fluorescence relative affinity (QuMFRA) assay, we directly quantified the relative equilibrium binding constants (K(ref)) of Mnt to operators carrying all the possible dinucleotide combinations at these two positions. Results show that Mnt prefers binding to C, instead of wild-type A, at position 16 when wild-type C at position 17 is changed to other bases. The measured K(ref) values of double mutants were also higher than the values predicted from single mutants, demonstrating the non-independence of these two positions. The ability to produce a large number of quantitative binding data simultaneously and the potential to scale up makes QuMFRA a valuable tool for the large-scale study of macromolecular interaction.


Subject(s)
Bacteriophage P22/genetics , DNA/metabolism , Repressor Proteins/metabolism , Viral Proteins/metabolism , Base Sequence , Binding Sites , DNA/genetics , DNA-Binding Proteins/metabolism , Fluorescence , Fluorescent Dyes/metabolism , Models, Molecular , Mutation/genetics , Operator Regions, Genetic/genetics , Protein Binding , Salmonella/genetics , Salmonella/virology , Substrate Specificity , Thermodynamics , Viral Regulatory and Accessory Proteins
6.
Nucleic Acids Res ; 29(10): 2135-44, 2001 May 15.
Article in English | MEDLINE | ID: mdl-11353083

ABSTRACT

Post-transcriptional regulation of gene expression is often accomplished by proteins binding to specific sequence motifs in mRNA molecules, to affect their translation or stability. The motifs are often composed of a combination of sequence and structural constraints such that the overall structure is preserved even though much of the primary sequence is variable. While several methods exist to discover transcriptional regulatory sites in the DNA sequences of coregulated genes, the RNA motif discovery problem is much more difficult because of covariation in the positions. We describe the combined use of two approaches for RNA structure prediction, FOLDALIGN and COVE, that together can discover and model stem-loop RNA motifs in unaligned sequences, such as UTRs from post-transcriptionally coregulated genes. We evaluate the method on two datasets, one a section of rRNA genes with randomly truncated ends so that a global alignment is not possible, and the other a hyper-variable collection of IRE-like elements that were inserted into randomized UTR sequences. In both cases the combined method identified the motifs correctly, and in the rRNA example we show that it is capable of determining the structure, which includes bulge and internal loops as well as a variable length hairpin loop. Those automated results are quantitatively evaluated and found to agree closely with structures contained in curated databases, with correlation coefficients up to 0.9. A basic server, Stem-Loop Align SearcH (SLASH), which will perform stem-loop searches in unaligned RNA sequences, is available at http://www.bioinf.au.dk/slash/.


Subject(s)
Computational Biology , Nucleic Acid Conformation , RNA/chemistry , RNA/genetics , Software , Algorithms , Base Sequence , Databases as Topic , Internet , Molecular Sequence Data , RNA/metabolism , RNA, Archaeal/chemistry , RNA, Archaeal/genetics , RNA, Archaeal/metabolism , RNA, Ribosomal/chemistry , RNA, Ribosomal/genetics , RNA, Ribosomal/metabolism , Regulatory Sequences, Nucleic Acid/genetics , Sensitivity and Specificity , Sequence Alignment , Untranslated Regions/chemistry , Untranslated Regions/genetics , Untranslated Regions/metabolism
7.
Genome Res ; 11(4): 566-84, 2001 Apr.
Article in English | MEDLINE | ID: mdl-11282972

ABSTRACT

Identifying the complete transcriptional regulatory network for an organism is a major challenge. For each regulatory protein, we want to know all the genes it regulates, that is, its regulon. Examples of known binding sites can be used to estimate the binding specificity of the protein and to predict other binding sites. However, binding site predictions can be unreliable because determining the true specificity of the protein is difficult because of the considerable variability of binding sites. Because regulatory systems tend to be conserved through evolution, we can use comparisons between species to increase the reliability of binding site predictions. In this article, an approach is presented to evaluate the computational predictions of regulatory sites. We combine the prediction of transcription units having orthologous genes with the prediction of transcription factor binding sites based on probabilistic models. We augment the sets of genes in Escherichia coli that are expected to be regulated by two transcription factors, the cAMP receptor protein and the fumarate and nitrate reduction regulatory protein, through a comparison with the Haemophilus influenzae genome. At the same time, we learned more about the regulatory networks of H. influenzae, a species with much less experimental knowledge than E. coli. By studying orthologous genes subject to regulation by the same transcription factor, we also gained understanding of the evolution of the entire regulatory systems.


Subject(s)
Computational Biology , Escherichia coli Proteins , Genome, Bacterial , Genomics/methods , Regulon/genetics , Amino Acid Sequence , Bacterial Proteins/genetics , Binding Sites/genetics , Computational Biology/methods , Computational Biology/statistics & numerical data , Conserved Sequence , Cyclic AMP Receptor Protein/genetics , DNA-Binding Proteins/genetics , Escherichia coli/genetics , Genomics/statistics & numerical data , Iron-Sulfur Proteins/genetics , Molecular Sequence Data , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Transcription Factors/genetics
8.
Pac Symp Biocomput ; : 115-26, 2001.
Article in English | MEDLINE | ID: mdl-11262933

ABSTRACT

We are investigating the rules that govern protein-DNA interactions, using a statistical mechanics based formalism that is related to the Boltzmann Machine of the neural net literature. Our approach is data-driven, in which probabilistic algorithms are used to model protein-DNA interactions, given SELEX and/or phage data as input. In the current report, we trained the network using SELEX data, under the "one-to-one" model of interactions (i.e. one amino acid contacts one base). The trained network was able to successfully identify the wild-type binding sites of EGR and MIG protein families. The predictions using our method are the same or better than that of methods existing in the literature. However our methodology offers the potential to capitalise in quantitative detail, as well as to be used to explore more general model of interactions, given availability of data.


Subject(s)
Algorithms , DNA-Binding Proteins/chemistry , DNA/chemistry , Models, Chemical , Thermodynamics , Binding Sites , Data Interpretation, Statistical , Models, Statistical , Neural Networks, Computer , Peptide Library , Protein Binding , Saccharomyces cerevisiae/chemistry , Transcription Factors/chemistry
9.
Shock ; 15(3): 165-70, 2001 Mar.
Article in English | MEDLINE | ID: mdl-11236897

ABSTRACT

The traditional approach to the study of biology employs small-scale experimentation that results in the description of a molecular sequence of known function or relevance. In the era of the genome the reverse is true, as large-scale cloning and gene sequencing come first, followed by the use of computational methods to systematically determine gene function and regulation. The overarching goal of this new approach is to translate the knowledge learned from a systematic, global analysis of genomic data into a complete understanding of biology. For investigators who study shock, the specific goal is to increase understanding of the adaptive response to injury at the level of the entire genome. This review describes our initial experience using DNA microarrays to profile stress-induced changes in gene expression. We conclude that efforts to apply genomics to the study of injury are best coordinated by multi-disciplinary groups, because of the extensive expertise required.


Subject(s)
Genomics/trends , Research/trends , Wounds and Injuries/physiopathology , Forecasting , Genetic Techniques , Genome, Fungal , Genomics/methods , Humans , Multiple Organ Failure/genetics , Multiple Organ Failure/immunology , Multiple Organ Failure/pathology , Research Design , Saccharomyces cerevisiae/physiology , Spleen/immunology , Spleen/injuries , Spleen/physiopathology , Wounds and Injuries/genetics
10.
Genome Inform ; 12: 184-93, 2001.
Article in English | MEDLINE | ID: mdl-11791237

ABSTRACT

When a set of coregulated genes share a common structural RNA motif, e.g. a hairpin, most motif search approaches fail to locate the covarying but structurally conserved motif. There do exist methods that can locate structural RNA motifs, like FOLDALIGN, but the main problem with these methods is that they are computationally expensive. In FOLDALIGN, a major contribution to this is the use of a greedy algorithm to construct the multiple alignment. To ensure good quality many redundant computations must be made. However, by applying the greedy algorithm on a carefully selected subset of sequences, near full greedy quality can be obtained. The basic idea is to estimate the order in which the sequences entered a good greedy alignment. If such a ranking, found from all pairwise alignments, is in good agreement with the order of appearance in the multiple alignment, the core structural motif can be found by performing the greedy algorithm on just the top sequences in the ranking. The ranking used in this mini-greedy algorithm is found by using two complementing approaches: 1) When interpreting the FOLDALIGN score as an inner product (kernel), the sequences can be ranked according to their distance to their center of mass; 2) We construct an algorithm that attempts to find the K closest sequences in the vector space associated with the inner product, and the remaining sequences can be ranked by their minimum distance to any of the sequences, or to the center of mass in this set. The two approaches arecompared and merged, and the results discussed. We also show that structural alignments of near full greedy quality can found in significantly reduced time, using these methods. The algorithm is being included in the SLASH (Stem-Loop Align SearcH) server available at http://www.bioinf.au.dk/slash.


Subject(s)
Algorithms , RNA/chemistry , RNA/genetics , Base Sequence , Computational Biology , Databases, Nucleic Acid , Nucleic Acid Conformation , Sequence Alignment/statistics & numerical data
11.
Bioinformatics ; 16(6): 501-12, 2000 Jun.
Article in English | MEDLINE | ID: mdl-10980147

ABSTRACT

MOTIVATION: Methods that predict the structure of molecules by looking for statistical correlation have been quite effective. Unfortunately, these methods often disregard phylogenetic information in the sequences they analyze. Here, we present a number of statistics for RNA molecular-structure prediction. Besides common pair-wise comparisons, we consider a few reasonable statistics for base-triple predictions, and present an elaborate analysis of these methods. All these statistics incorporate phylogenetic relationships of the sequences in the analysis to varying degrees, and the different nature of these tests gives a wide choice of statistical tools for RNA structure prediction. RESULTS: Starting from statistics that incorporate phylogenetic information only as independent sequence evolution models for each position of a multiple alignment, and extending this idea to a joint evolution model of two positions, we enhance the usual purely statistical methods (e.g. methods based on the Mutual Information statistic) with the use of phylogenetic information available in the sequences. In particular, we present a joint model based on the HKY evolution model, and consequently a X(2) test of independence for two positions. A significant part of this work is devoted to some mathematical analysis of these methods. We tested these statistics on regions of 16S and 23S rRNA, and tRNA.


Subject(s)
RNA/chemistry , RNA/genetics , Sequence Analysis, RNA/statistics & numerical data , Base Sequence , Biometry , Escherichia coli/genetics , Evolution, Molecular , Likelihood Functions , Models, Genetic , Molecular Sequence Data , Nucleic Acid Conformation , Phylogeny , RNA, Bacterial/chemistry , RNA, Bacterial/genetics , RNA, Ribosomal, 16S/chemistry , RNA, Ribosomal, 16S/genetics
12.
Pac Symp Biocomput ; : 467-78, 2000.
Article in English | MEDLINE | ID: mdl-10902194

ABSTRACT

This work describes ANN-Spec, a machine learning algorithm and its application to discovering un-gapped patterns in DNA sequence. The approach makes use of an Artificial Neural Network and a Gibbs sampling method to define the Specificity of a DNA-binding protein. ANN-Spec searches for the parameters of a simple network (or weight matrix) that will maximize the specificity for binding sequences of a positive set compared to a background sequence set. Binding sites in the positive data set are found with the resulting weight matrix and these sites are then used to define a local multiple sequence alignment. Training complexity is O(lN) where l is the width of the pattern and N is the size of the positive training data. A quantitative comparison of ANN-Spec and a few related programs is presented. The comparison shows that ANN-Spec finds patterns of higher specificity when training with a background data set. The program and documentation are available from the authors for UNIX systems.


Subject(s)
Algorithms , Software , Transcription Factors/metabolism , Binding Sites/genetics , Computer Simulation , DNA/genetics , DNA/metabolism , Gene Expression Regulation , Models, Biological , Neural Networks, Computer , Sensitivity and Specificity
13.
Bioinformatics ; 16(1): 16-23, 2000 Jan.
Article in English | MEDLINE | ID: mdl-10812473

ABSTRACT

The purpose of this article is to provide a brief history of the development and application of computer algorithms for the analysis and prediction of DNA binding sites. This problem can be conveniently divided into two subproblems. The first is, given a collection of known binding sites, develop a representation of those sites that can be used to search new sequences and reliably predict where additional binding sites occur. The second is, given a set of sequences known to contain binding sites for a common factor, but not knowing where the sites are, discover the location of the sites in each sequence and a representation for the specificity of the protein.


Subject(s)
DNA-Binding Proteins/metabolism , DNA/analysis , Binding Sites , DNA/history , DNA-Binding Proteins/history , History, 20th Century , Research/history
15.
Nucleic Acids Res ; 28(24): 4938-43, 2000 Dec 15.
Article in English | MEDLINE | ID: mdl-11121485

ABSTRACT

Recent biochemical studies have indicated a number of regions in both the 16S and 23S rRNA that are exposed on the ribosomal subunit surface. In order to predict potential interactions between these regions we applied novel phylogenetically-based statistical methods to detect correlated nucleotide changes occurring between the rRNA molecules. With these methods we discovered a number of highly significant correlated changes between different sets of nucleotides in the two ribosomal subunits. The predictions with the highest correlation values belong to regions of the rRNA subunits that are in close proximity according to recent crystal structures of the entire ribosome. We also applied a new statistical method of detecting base triple interactions within these same rRNA subunit regions. This base triple statistic predicted a number of new base triples not detected by pair-wise interaction statistics within the rRNA molecules. Our results suggest that these statistical methods may enhance the ability to detect novel structural elements both within and between RNA molecules.


Subject(s)
Phylogeny , RNA, Ribosomal, 16S/metabolism , RNA, Ribosomal, 23S/metabolism , Animals , Base Sequence , Binding Sites , Computational Biology , Databases as Topic , Genes, Archaeal/genetics , Genes, Bacterial/genetics , Molecular Sequence Data , RNA, Ribosomal, 16S/genetics , RNA, Ribosomal, 23S/genetics , Sequence Alignment , Statistics as Topic
16.
Bioinformatics ; 15(7-8): 563-77, 1999.
Article in English | MEDLINE | ID: mdl-10487864

ABSTRACT

MOTIVATION: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. RESULTS: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein. AVAILABILITY: Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus.


Subject(s)
DNA/genetics , Proteins/genetics , Sequence Alignment/methods , Algorithms , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Base Sequence , Binding Sites/genetics , Carrier Proteins , Cyclic AMP Receptor Protein/genetics , Cyclic AMP Receptor Protein/metabolism , DNA, Bacterial/genetics , Escherichia coli/genetics , Escherichia coli/metabolism , Linear Models , Sequence Alignment/statistics & numerical data , Software
17.
Pac Symp Biocomput ; : 112-23, 1999.
Article in English | MEDLINE | ID: mdl-10380190

ABSTRACT

Systematic gene expression analyses provide comprehensive information about the transcriptional response to different environmental and developmental conditions. With enough gene expression data points, computational biologists may eventually generate predictive computer models of transcription regulation. Such models will require computational methodologies consistent with the behavior of known biological systems that remain tractable. We represent regulatory relationships between genes as linear coefficients or weights, with the "net" regulation influence on a gene's expression being the mathematical summation of the independent regulatory inputs. Test regulatory networks generated with this approach display stable and cyclically stable gene expression levels, consistent with known biological systems. We include variables to model the effect of environmental conditions on transcription regulation and observed various alterations in gene expression patterns in response to environmental input. Finally, we use a derivation of this model system to predict the regulatory network from simulated input/output data sets and find that it accurately predicts all components of the model, even with noisy expression data.


Subject(s)
Computational Biology/methods , Databases, Factual , Gene Expression Regulation , Models, Genetic , Computer Simulation , Environment , Gene Expression Regulation, Developmental , Reproducibility of Results , Software , Transcription, Genetic
18.
Article in English | MEDLINE | ID: mdl-10786281

ABSTRACT

Methods based on the Mutual Information statistic (MI methods) predict structure by looking for statistical correlations between sequence positions in a set of aligned sequences. Although MI methods are often quite effective, these methods ignore the underlying phylogenetic relationships of the sequences they analyze. Thus, they cannot distinguish between correlations due to structural interactions, and spurious correlations resulting from phylogenetic history. In this paper, we introduce a method analogous to MI that incorporates phylogenetic information. We show that this method accurately recovers the structures of well-known RNA molecules. We also demonstrate, with both real and simulated data, that this phylogenetically-based method outperforms standard MI methods, and improves the ability to distinguish interacting from non-interacting positions in RNA. This method is flexible, and may be applied to the prediction of protein structure given the appropriate evolutionary model. Because this method incorporates phylogenetic data, it also has the potential to be improved with the addition of more accurate phylogenetic information, although we show that even approximate phylogenies are helpful.


Subject(s)
Computer Simulation , Nucleic Acid Conformation , RNA, Ribosomal, 16S/chemistry , RNA, Transfer/chemistry , RNA/chemistry , Models, Statistical , Phylogeny
20.
Bioinformatics ; 14(8): 691-9, 1998.
Article in English | MEDLINE | ID: mdl-9789095

ABSTRACT

MOTIVATION: Recently, we described a Maximum Weighted Matching (MWM) method for RNA structure prediction. The MWM method is capable of detecting pseudoknots and other tertiary base-pairing interactions in a computationally efficient manner (Cary and Stormo, Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pp. 75-80, 1995). Here we report on the results of our efforts to improve the MWM method's predictive accuracy, and show how the method can be extended to detect base interactions formerly inaccessible to automated RNA modeling techniques. RESULTS: Improved performance in MWM structure prediction was achieved in two ways. First, new ways of calculating base pair likelihoods have been developed. These allow experimental data and combined statistical and thermodynamic information to be used by the program. Second, accuracy was improved by developing techniques for filtering out spurious base pairs predicted by the MWM program. We also demonstrate here a means by which the MWM folding method may be used to detect the presence of base triples in RNAs. AVAILABILITY: http://www.cshl.org/mzhanglab/tabaska/j axpage. html CONTACT: tabaska@cshl.org


Subject(s)
Algorithms , Nucleic Acid Conformation , RNA/chemistry , Bacillus subtilis/genetics , Base Sequence , Escherichia coli/genetics , Molecular Sequence Data , Phylogeny , RNA, Bacterial/chemistry , Thermodynamics
SELECTION OF CITATIONS
SEARCH DETAIL
...