Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 30
Filter
Add more filters











Publication year range
1.
Article in English | MEDLINE | ID: mdl-17381287

ABSTRACT

Genome sequence analysis of RNAs presents special challenges to computational biology, because conserved RNA secondary structure plays a large part in RNA analysis. Algorithms well suited for RNA secondary structure and sequence analysis have been borrowed from computational linguistics. These "stochastic context-free grammar" (SCFG) algorithms have enabled the development of new RNA gene-finding and RNA homology search software. The aim of this paper is to provide an accessible introduction to the strengths and weaknesses of SCFG methods and to describe the state of the art in one particular kind of application: SCFG-based RNA similarity searching. The INFERNAL and RSEARCH programs are capable of identifying distant RNA homologs in a database search by looking for both sequence and secondary structure conservation.


Subject(s)
Sequence Analysis, RNA/statistics & numerical data , Algorithms , Base Sequence , Computational Biology , Databases, Nucleic Acid , Models, Statistical , Molecular Sequence Data , Nucleic Acid Conformation , RNA/analysis , RNA/chemistry , RNA/genetics , Sequence Alignment/statistics & numerical data , Software , Stochastic Processes
2.
Nat Rev Genet ; 2(12): 919-29, 2001 Dec.
Article in English | MEDLINE | ID: mdl-11733745

ABSTRACT

Non-coding RNA (ncRNA) genes produce functional RNA molecules rather than encoding proteins. However, almost all means of gene identification assume that genes encode proteins, so even in the era of complete genome sequences, ncRNA genes have been effectively invisible. Recently, several different systematic screens have identified a surprisingly large number of new ncRNA genes. Non-coding RNAs seem to be particularly abundant in roles that require highly specific nucleic acid recognition without complex catalysis, such as in directing post-transcriptional regulation of gene expression or in guiding RNA modifications.


Subject(s)
RNA, Untranslated/genetics , Base Sequence , Escherichia coli/genetics , Humans , Molecular Sequence Data , Nucleic Acid Conformation , RNA, Bacterial/genetics , RNA, Untranslated/chemistry
3.
Bioinformatics ; 17(9): 821-8, 2001 Sep.
Article in English | MEDLINE | ID: mdl-11590098

ABSTRACT

MOTIVATION: When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation ('phylogenomics') is widely recognized, but existing approaches are either manual or indirect (e.g. not based on phylogenetic trees). Our goal is to automate phylogenomics using explicit phylogenetic inference. A necessary component is an algorithm to infer speciation and duplication events in a given gene tree. RESULTS: We give an algorithm to infer speciation and duplication events on a gene tree by comparison to a trusted species tree. This algorithm has a worst-case running time of O(n(2)) which is inferior to two previous algorithms that are approximately O(n) for a gene tree of sequences. However, our algorithm is extremely simple, and its asymptotic worst case behavior is only realized on pathological data sets. We show empirically, using 1750 gene trees constructed from the Pfam protein family database, that it appears to be a practical (and often superior) algorithm for analyzing real gene trees. AVAILABILITY: http://www.genetics.wustl.edu/eddy/forester.


Subject(s)
Algorithms , Computational Biology/methods , Evolution, Molecular , Gene Duplication , Phylogeny , Animals , Cattle , Chick Embryo , Drosophila melanogaster , Humans , Mice , Models, Genetic , Rats , Sequence Homology, Nucleic Acid , Software , Species Specificity , Swine , Xenopus laevis
4.
BMC Bioinformatics ; 2: 7, 2001.
Article in English | MEDLINE | ID: mdl-11667947

ABSTRACT

BACKGROUND: Currently, most genome annotation is curated by centralized groups with limited resources. Efforts to share annotations transparently among multiple groups have not yet been satisfactory. RESULTS: Here we introduce a concept called the Distributed Annotation System (DAS). DAS allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. The communication between client and servers in DAS is defined by the DAS XML specification. Annotations are displayed in layers, one per server. Any client or server adhering to the DAS XML specification can participate in the system; we describe a simple prototype client and server example. CONCLUSIONS: The DAS specification is being used experimentally by Ensembl, WormBase, and the Berkeley Drosophila Genome Project. Continued success will depend on the readiness of the research community to adopt DAS and provide annotations. All components are freely available from the project website http://www.biodas.org/.


Subject(s)
Computational Biology/methods , Base Sequence/physiology , Computational Biology/instrumentation , Computer Terminals , Databases, Genetic/standards , Genome, Human , Humans , Internet , Reference Values , Software
5.
Curr Biol ; 11(17): 1369-73, 2001 Sep 04.
Article in English | MEDLINE | ID: mdl-11553332

ABSTRACT

Some genes produce noncoding transcripts that function directly as structural, regulatory, or even catalytic RNAs [1, 2]. Unlike protein-coding genes, which can be detected as open reading frames with distinctive statistical biases, noncoding RNA (ncRNA) gene sequences have no obvious inherent statistical biases [3]. Thus, genome sequence analyses reveal novel protein-coding genes, but any novel ncRNA genes remain invisible. Here, we describe a computational comparative genomic screen for ncRNA genes. The key idea is to distinguish conserved RNA secondary structures from a background of other conserved sequences using probabilistic models of expected mutational patterns in pairwise sequence alignments. We report the first whole-genome screen for ncRNA genes done with this method, in which we applied it to the "intergenic" spacers of Escherichia coli using comparative sequence data from four related bacteria. Starting from >23,000 conserved interspecies pairwise alignments, the screen predicted 275 candidate structural RNA loci. A sample of 49 candidate loci was assayed experimentally. At least 11 loci expressed small, apparently noncoding RNA transcripts of unknown function. Our computational approach may be used to discover structural ncRNA genes in any genome for which appropriate comparative genome sequence data are available.


Subject(s)
Escherichia coli/genetics , RNA, Bacterial/analysis , RNA, Untranslated/analysis , Animals , Gene Expression , Genome, Bacterial , Humans
6.
Genome Res ; 11(8): 1346-52, 2001 Aug.
Article in English | MEDLINE | ID: mdl-11483575

ABSTRACT

Gene expression in a developmentally arrested, long-lived dauer population of Caenorhabditis elegans was compared with a nondauer (mixed-stage) population by using serial analysis of gene expression (SAGE). Dauer (152,314) and nondauer (148,324) SAGE tags identified 11,130 of the predicted 19,100 C. elegans genes. Genes implicated previously in longevity were expressed abundantly in the dauer library, and new genes potentially important in dauer biology were discovered. Two thousand six hundred eighteen genes were detected only in the nondauer population, whereas 2016 genes were detected only in the dauer, showing that dauer larvae show a surprisingly complex gene expression profile. Evidence for differentially expressed gene transcript isoforms was obtained for 162 genes. H1 histones were differentially expressed, raising the possibility of alternative chromatin packaging. The most abundant tag from dauer larvae (20-fold more abundant than in the nondauer profile) corresponds to a new, unpredicted gene we have named tts-1 (transcribed telomere-like sequence), which may interact with telomeres or telomere-associated proteins. Abundant antisense mitochondrial transcripts (2% of all tags), suggest the existence of an antisense-mediated regulatory mechanism in C. elegans mitochondria. In addition to providing a robust tool for gene expression studies, the SAGE approach already has provided the advantage of new gene/transcript discovery in a metazoan.


Subject(s)
Caenorhabditis elegans/growth & development , Caenorhabditis elegans/genetics , Gene Expression Regulation, Developmental/physiology , Genes, Helminth/physiology , Animals , Gene Expression Profiling/methods , Longevity/genetics , RNA, Messenger/analysis
7.
Bioinformatics ; 17(4): 383-4, 2001 Apr.
Article in English | MEDLINE | ID: mdl-11301314

ABSTRACT

A Tree Viewer (ATV) is a Java tool for the display and manipulation of annotated phylogenetic trees. It can be utilized both as a standalone application and as an applet in a web browser.


Subject(s)
Proteins/classification , Software , Databases, Factual , Humans , Internet , Phylogeny
8.
BMC Bioinformatics ; 2: 8, 2001.
Article in English | MEDLINE | ID: mdl-11801179

ABSTRACT

BACKGROUND: Noncoding RNA genes produce transcripts that exert their function without ever producing proteins. Noncoding RNA gene sequences do not have strong statistical signals, unlike protein coding genes. A reliable general purpose computational genefinder for noncoding RNA genes has been elusive. RESULTS: We describe a comparative sequence analysis algorithm for detecting novel structural RNA genes. The key idea is to test the pattern of substitutions observed in a pairwise alignment of two homologous sequences. A conserved coding region tends to show a pattern of synonymous substitutions, whereas a conserved structural RNA tends to show a pattern of compensatory mutations consistent with some base-paired secondary structure. We formalize this intuition using three probabilistic "pair-grammars": a pair stochastic context free grammar modeling alignments constrained by structural RNA evolution, a pair hidden Markov model modeling alignments constrained by coding sequence evolution, and a pair hidden Markov model modeling a null hypothesis of position-independent evolution. Given an input pairwise sequence alignment (e.g. from a BLASTN comparison of two related genomes) we classify the alignment into the coding, RNA, or null class according to the posterior probability of each class. CONCLUSIONS: We have implemented this approach as a program, QRNA, which we consider to be a prototype structural noncoding RNA genefinder. Tests suggest that this approach detects noncoding RNA genes with a fair degree of reliability.


Subject(s)
RNA, Untranslated/genetics , Sequence Analysis, RNA/methods , Algorithms , Animals , Base Sequence , Bayes Theorem , Caenorhabditis/genetics , Caenorhabditis elegans/genetics , Computational Biology/methods , Computational Biology/statistics & numerical data , Computer Simulation , Escherichia coli/genetics , Genome , Genome, Bacterial , Models, Genetic , Molecular Sequence Data , RNA, Bacterial/genetics , RNA, Helminth/genetics , Salmonella typhi/genetics , Sensitivity and Specificity , Sequence Analysis, RNA/statistics & numerical data
9.
Bioinformatics ; 16(7): 583-605, 2000 Jul.
Article in English | MEDLINE | ID: mdl-11038329

ABSTRACT

MOTIVATION: Several results in the literature suggest that biologically interesting RNAs have secondary structures that are more stable than expected by chance. Based on these observations, we developed a scanning algorithm for detecting noncoding RNA genes in genome sequences, using a fully probabilistic version of the Zuker minimum-energy folding algorithm. RESULTS: Preliminary results were encouraging, but certain anomalies led us to do a carefully controlled investigation of this class of methods. Ultimately, our results argue that for the probabilistic model there is indeed a statistical effect, but it comes mostly from local base-composition bias and not from RNA secondary structure. For the thermodynamic implementation (which evaluates statistical significance by doing Monte Carlo shuffling in fixed-length sequence windows, thus eliminating the base-composition effect) the signals for noncoding RNAs are still usually indistinguishable from noise, especially when certain statistical artifacts resulting from local base-composition inhomogeneity are taken into account. We conclude that although a distinct, stable secondary structure is undoubtedly important in most noncoding RNAs, the stability of most noncoding RNA secondary structures is not sufficiently different from the predicted stability of a random sequence to be useful as a general genefinding approach.


Subject(s)
Algorithms , Nucleic Acid Conformation , RNA/chemistry , Animals , Base Composition , Caenorhabditis elegans/genetics , Data Interpretation, Statistical , Methanococcus/genetics , Models, Statistical , RNA, Helminth/analysis
10.
Bioinformatics ; 16(4): 334-40, 2000 Apr.
Article in English | MEDLINE | ID: mdl-10869031

ABSTRACT

MOTIVATION: In a previous paper, we presented a polynomial time dynamic programming algorithm for predicting optimal RNA secondary structure including pseudoknots. However, a formal grammatical representation for RNA secondary structure with pseudoknots was still lacking. RESULTS: Here we show a one-to-one correspondence between that algorithm and a formal transformational grammar. This grammar class encompasses the context-free grammars and goes beyond to generate pseudoknotted structures. The pseudoknot grammar avoids the use of general context-sensitive rules by introducing a small number of auxiliary symbols used to reorder the strings generated by an otherwise context-free grammar. This formal representation of the residue correlations in RNA structure is important because it means we can build full probabilistic models of RNA secondary structure, including pseudoknots, and use them to optimally parse sequences in polynomial time.


Subject(s)
Algorithms , Nucleic Acid Conformation , RNA/chemistry
11.
Science ; 288(5465): 517-22, 2000 Apr 21.
Article in English | MEDLINE | ID: mdl-10775111

ABSTRACT

In eukaryotes, dozens of posttranscriptional modifications are directed to specific nucleotides in ribosomal RNAs (rRNAs) by small nucleolar RNAs (snoRNAs). We identified homologs of snoRNA genes in both branches of the Archaea. Eighteen small sno-like RNAs (sRNAs) were cloned from the archaeon Sulfolobus acidocaldarius by coimmunoprecipitation with archaeal fibrillarin and NOP56, the homologs of eukaryotic snoRNA-associated proteins. We trained a probabilistic model on these sRNAs to search for more sRNAs in archaeal genomic sequences. Over 200 additional sRNAs were identified in seven archaeal genomes representing both the Crenarchaeota and the Euryarchaeota. snoRNA-based rRNA processing was therefore probably present in the last common ancestor of Archaea and Eukarya, predating the evolution of a morphologically distinct nucleolus.


Subject(s)
Archaea/genetics , RNA, Archaeal/genetics , RNA, Small Nucleolar/genetics , Sulfolobus acidocaldarius/genetics , Archaeal Proteins/genetics , Base Sequence , Chromosomal Proteins, Non-Histone/genetics , Cloning, Molecular , Genome, Archaeal , Methylation , Models, Statistical , Molecular Sequence Data , Nuclear Proteins/genetics , RNA Processing, Post-Transcriptional , RNA, Archaeal/chemistry , RNA, Archaeal/metabolism , RNA, Ribosomal/chemistry , RNA, Ribosomal/genetics , RNA, Ribosomal/metabolism , RNA, Small Nucleolar/chemistry , RNA, Small Nucleolar/metabolism , RNA, Small Untranslated
12.
Nucleic Acids Res ; 28(1): 263-6, 2000 Jan 01.
Article in English | MEDLINE | ID: mdl-10592242

ABSTRACT

Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the WWW in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgr.ki.se/Pfam/ and in the US at http://pfam.wustl.edu/. The latest version (4.3) of Pfam contains 1815 families. These Pfam families match 63% of proteins in SWISS-PROT 37 and TrEMBL 9. For complete genomes Pfam currently matches up to half of the proteins. Genomic DNA can be directly searched against the Pfam library using the Wise2 package.


Subject(s)
Databases, Factual , Proteins/chemistry , Genome , Information Storage and Retrieval , Internet , Quality Control
13.
Curr Opin Genet Dev ; 9(6): 695-9, 1999 Dec.
Article in English | MEDLINE | ID: mdl-10607607

ABSTRACT

Some genes produce RNAs that are functional instead of encoding proteins. Noncoding RNA genes are surprisingly numerous. Recently, active research areas include small nucleolar RNAs, antisense riboregulator RNAs, and RNAs involved in X-dosage compensation. Genome sequences and new algorithms have begun to make systematic computational screens for noncoding RNA genes possible.


Subject(s)
Genes/genetics , RNA/genetics , Animals , Computational Biology , Genes/physiology , Genome , Humans , RNA/chemistry , RNA/physiology , RNA, Antisense/genetics , RNA, Ribosomal/genetics , RNA, Small Nuclear/genetics , RNA, Small Nucleolar/genetics , RNA, Transfer/genetics
14.
J Mol Biol ; 285(5): 2053-68, 1999 Feb 05.
Article in English | MEDLINE | ID: mdl-9925784

ABSTRACT

We describe a dynamic programming algorithm for predicting optimal RNA secondary structure, including pseudoknots. The algorithm has a worst case complexity of O(N6) in time and O(N4) in storage. The description of the algorithm is complex, which led us to adopt a useful graphical representation (Feynman diagrams) borrowed from quantum field theory. We present an implementation of the algorithm that generates the optimal minimum energy structure for a single RNA sequence, using standard RNA folding thermodynamic parameters augmented by a few parameters describing the thermodynamic stability of pseudoknots. We demonstrate the properties of the algorithm by using it to predict structures for several small pseudoknotted and non-pseudoknotted RNAs. Although the time and memory demands of the algorithm are steep, we believe this is the first algorithm to be able to fold optimal (minimum energy) pseudoknotted RNAs with the accepted RNA thermodynamic model.


Subject(s)
Algorithms , Nucleic Acid Conformation , RNA/chemistry , HIV Reverse Transcriptase/metabolism , Models, Genetic , RNA, Transfer/chemistry , RNA, Viral/chemistry , RNA, Viral/metabolism , Thermodynamics
15.
Science ; 283(5405): 1168-71, 1999 Feb 19.
Article in English | MEDLINE | ID: mdl-10024243

ABSTRACT

Small nucleolar RNAs (snoRNAs) are required for ribose 2'-O-methylation of eukaryotic ribosomal RNA. Many of the genes for this snoRNA family have remained unidentified in Saccharomyces cerevisiae, despite the availability of a complete genome sequence. Probabilistic modeling methods akin to those used in speech recognition and computational linguistics were used to computationally screen the yeast genome and identify 22 methylation guide snoRNAs, snR50 to snR71. Gene disruptions and other experimental characterization confirmed their methylation guide function. In total, 51 of the 55 ribose methylated sites in yeast ribosomal RNA were assigned to 41 different guide snoRNAs.


Subject(s)
Algorithms , Models, Genetic , Models, Statistical , RNA, Fungal/analysis , RNA, Ribosomal/metabolism , RNA, Small Nuclear/analysis , Saccharomyces cerevisiae/genetics , Base Pairing , Cell Nucleolus/metabolism , Methylation , RNA, Fungal/chemistry , RNA, Fungal/genetics , RNA, Fungal/metabolism , RNA, Ribosomal/chemistry , RNA, Ribosomal/genetics , RNA, Small Nuclear/chemistry , RNA, Small Nuclear/genetics , Ribose/metabolism , Software
16.
Nucleic Acids Res ; 27(1): 260-2, 1999 Jan 01.
Article in English | MEDLINE | ID: mdl-9847196

ABSTRACT

Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families. Release 3.1 is a major update of the Pfam database and contains 1313 families which are available on the World Wide Web in Europe at http://www.sanger.ac.uk/Software/Pfam/ and http://www.cgr.ki.se/Pfam/, and in the US at http://pfam.wustl.edu/. Over 54% of proteins in SWISS-PROT-35 and SP-TrEMBL-5 match a Pfam family. The primary changes of Pfam since release 2.1 are that we now use the more advanced version 2 of the HMMER software, which is more sensitive and provides expectation values for matches, and that it now includes proteins from both SP-TrEMBL and SWISS-PROT.


Subject(s)
Databases, Factual , Proteins/chemistry , Sequence Alignment , Software , Algorithms , Amino Acid Sequence , Databases, Factual/standards , Information Storage and Retrieval , Internet , Protein Conformation , Proteins/genetics , Quality Control , Sequence Homology, Amino Acid , Statistics as Topic
17.
Nucleic Acids Res ; 26(1): 320-2, 1998 Jan 01.
Article in English | MEDLINE | ID: mdl-9399864

ABSTRACT

Pfam contains multiple alignments and hidden Markov model based profiles (HMM-profiles) of complete protein domains. The definition of domain boundaries, family members and alignment is done semi-automatically based on expert knowledge, sequence similarity, other protein family databases and the ability of HMM-profiles to correctly identify and align the members. Release 2.0 of Pfam contains 527 manually verified families which are available for browsing and on-line searching via the World Wide Web in the UK at http://www.sanger.ac.uk/Pfam/ and in the US at http://genome.wustl. edu/Pfam/ Pfam 2.0 matches one or more domains in 50% of Swissprot-34 sequences, and 25% of a large sample of predicted proteins from the Caenorhabditis elegans genome.


Subject(s)
Databases, Factual , Proteins/chemistry , Sequence Alignment , Amino Acid Sequence , Animals , Caenorhabditis elegans , Computer Communication Networks , Information Storage and Retrieval , Markov Chains , Models, Molecular , Molecular Sequence Data
18.
Bioinformatics ; 14(9): 755-63, 1998.
Article in English | MEDLINE | ID: mdl-9918945

ABSTRACT

The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.


Subject(s)
Markov Chains , Humans , Peptide Library , Sequence Alignment/methods , Software
19.
Proteins ; 28(3): 405-20, 1997 Jul.
Article in English | MEDLINE | ID: mdl-9223186

ABSTRACT

Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences.


Subject(s)
Amino Acid Sequence , Databases, Factual , Plant Proteins/chemistry , Protein Structure, Tertiary , Sequence Alignment , Models, Chemical , Molecular Sequence Data , Multigene Family , Seeds/chemistry , Sequence Homology, Amino Acid
20.
Nucleic Acids Res ; 25(5): 955-64, 1997 Mar 01.
Article in English | MEDLINE | ID: mdl-9023104

ABSTRACT

We describe a program, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. Two previously described tRNA detection programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a highly selective tRNA covariance model. This work represents a practical application of RNA covariance models, which are general, probabilistic secondary structure profiles based on stochastic context-free grammars. tRNAscan-SE searches at approximately 30 000 bp/s. Additional extensions to tRNAscan-SE detect unusual tRNA homologues such as selenocysteine tRNAs, tRNA-derived repetitive elements and tRNA pseudogenes.


Subject(s)
RNA, Transfer/genetics , Software , Animals , Databases, Factual , Evaluation Studies as Topic , Genome , Introns , RNA/genetics , RNA, Bacterial/analysis , RNA, Bacterial/genetics , RNA, Mitochondrial , RNA, Transfer/analysis , RNA, Transfer, Amino Acid-Specific/genetics
SELECTION OF CITATIONS
SEARCH DETAIL