Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 21
Filter
Add more filters










Publication year range
1.
PeerJ ; 9: e11717, 2021.
Article in English | MEDLINE | ID: mdl-34458017

ABSTRACT

BACKGROUND: High-throughput sequencing has become an essential technology in life science research. Despite continuous improvements in technology, the produced sequences are still not entirely accurate. Consequently, the sequences are usually equipped with error probabilities. The quality information is already employed to find better solutions to a number of bioinformatics problems (e.g. read mapping). Data processing pipelines benefit in particular (especially when incorporating the quality information early), since enhanced outcomes of one step can improve all subsequent ones. Preprocessing steps, thus, quite regularly consider the sequence quality to fix errors or discard low-quality data. Other steps, however, like clustering sequences into operational taxonomic units (OTUs), a common task in the analysis of microbial communities, are typically performed without making use of the available quality information. RESULTS: In this paper, we present quality-aware clustering methods inspired by quality-weighted alignments and model-based denoising, and explore their applicability to OTU clustering. We implemented the quality-aware methods in a revised version of our de novo clustering tool GeFaST and evaluated their clustering quality and performance on mock-community data sets. Quality-weighted alignments were able to improve the clustering quality of GeFaST by up to 10%. The examination of the model-supported methods provided a more diverse picture, hinting at a narrower applicability, but they were able to attain similar improvements. Considering the quality information enlarged both runtime and memory consumption, even though the increase of the former depended heavily on the applied method and clustering threshold. CONCLUSIONS: The quality-aware methods expand the iterative, de novo clustering approach by new clustering and cluster refinement methods. Our results indicate that OTU clustering constitutes yet another analysis step benefiting from the integration of quality information. Beyond the shown potential, the quality-aware methods offer a range of opportunities for fine-tuning and further extensions.

2.
BMC Bioinformatics ; 19(1): 321, 2018 Sep 12.
Article in English | MEDLINE | ID: mdl-30208838

ABSTRACT

BACKGROUND: Massive genomic data sets from high-throughput sequencing allow for new insights into complex biological systems such as microbial communities. Analyses of their diversity and structure are typically preceded by clustering millions of 16S rRNA gene sequences into OTUs. Swarm introduced a new clustering strategy which addresses important conceptual and performance issues of the popular de novo clustering approach. However, some parts of the new strategy, e.g. the fastidious option for increased clustering quality, come with their own restrictions. RESULTS: In this paper, we present the new exact, alignment-based de novo clustering tool GeFaST, which implements a generalisation of Swarm's fastidious clustering. Our tool extends the fastidious option to arbitrary clustering thresholds and allows to adjust its greediness. GeFaST was evaluated on mock-community and natural data and achieved higher clustering quality and performance for small to medium clustering thresholds compared to Swarm and other de novo tools. Clustering with GeFaST was between 6 and 197 times as fast as with Swarm, while the latter required up to 38% less memory for non-fastidious clustering but at least three times as much memory for fastidious clustering. CONCLUSIONS: GeFaST extends the scope of Swarm's clustering strategy by generalising its fastidious option, thereby allowing for gains in clustering quality, and by increasing its performance (especially in the fastidious case). Our evaluations showed that GeFaST has the potential to leverage the use of the (fastidious) clustering strategy for higher thresholds and on larger data sets.


Subject(s)
Algorithms , High-Throughput Nucleotide Sequencing/methods , Cluster Analysis , Databases as Topic , RNA, Ribosomal, 16S/genetics , Time Factors
3.
J Math Biol ; 72(3): 527-71, 2016 Feb.
Article in English | MEDLINE | ID: mdl-26001743

ABSTRACT

Various tools used to predict the secondary structure for a given RNA sequence are based on dynamic programming used to compute a conformation of minimum free energy. For structures without pseudoknots, a worst-case runtime proportional to n3, with n being the length of the sequence, results since a table of dimension n2 has to be filled in while a single entry gives rise to a linear computational effort. However, it was recently observed that reformulating the corresponding dynamic programming recursion together with the bookkeeping of potential folding alternatives (a technique called sparsification) may reduce the runtime to n2 on average, assuming that nucleotides of distance d form a hydrogen bond (i..e., are paired) with probability b/d(c) for some constants b > 0, c > 1. The latter is called the polymer-zeta model and plays a crucial role in speeding up the above mentioned algorithm. In this paper we discuss the application of the polymer-zeta property for the analysis of sparsification, showing that it must be applied conditionally on first and last positions to pair. Afterwards, we will investigate the combinatorics of RNA secondary structures assuming that the corresponding conditional probabilities behave according to a polymer-zeta probability model. We show that even if some of the structural parameters exhibit an almost realistic behavior on average, the expected shape of a folding in that model must be assumed to highly differ from those observed in nature. More precisely, we prove our polymer-zeta model to be appropriate for mRNA molecules but to fail in connection with almost every other family of RNA. Those findings explain the huge speedup of the dynamic programming algorithm observed empirically by Wexler et al. when applying sparsification in connection with mRNA data.


Subject(s)
Nucleic Acid Conformation , RNA/chemistry , Algorithms , Computational Biology/methods , Inverted Repeat Sequences , Linear Models , Mathematical Concepts , Models, Molecular , Probability , RNA/genetics
4.
J Comput Biol ; 22(7): 619-48, 2015 Jul.
Article in English | MEDLINE | ID: mdl-26098199

ABSTRACT

The structure of RNA has been the subject of intense research over the last decades due to its importance for the correct functioning of RNA molecules in biological processes. Hence, a large number of models for RNA folding and corresponding algorithms for structure prediction have been developed. However, previous models often only consider base pairs, although every base is capable of up to three edge-to-edge interactions with other bases. Recently, Höner zu Siederdissen et al. presented an extended model of RNA secondary structure, including base triples together with a folding algorithm-the first thermodynamics-based algorithm that allows the prediction of secondary structures with base triples. In this article, we investigate the search space processed by this new algorithm, that is, the combinatorics of extended RNA secondary structures with base triples. We present generalized definitions for structural motifs like hairpins, stems, bulges, or interior loops occurring in structures with base triples. Furthermore, we prove precise asymptotic results for the number of different structures (size of search space) and expectations for various parameters associated with structural motifs (typical shape of folding). Our analysis shows that the asymptotic number of secondary structures of size n increases exponentially to [Formula: see text] compared to the classic model by Stein and Waterman for which [Formula: see text] structures exist. A comparison with the classic model reveals large deviations in the expected structural appearance, too. The inclusion of base triples constitutes a significant refinement of the combinatorial model of RNA secondary structure, which, by our findings, is quantitatively characterized. Our results are of special theoretical interest, because a closer look at the numbers involved suggests that extended RNA secondary structures constitute a new combinatorial class not bijective with any other combinatorial objects studied so far.


Subject(s)
RNA/chemistry , Algorithms , Base Pairing , Base Sequence , Inverted Repeat Sequences , Models, Chemical
5.
Math Biosci ; 245(2): 216-25, 2013 Oct.
Article in English | MEDLINE | ID: mdl-23900061

ABSTRACT

In this paper we present a sampling framework for RNA structures of fixed topological genus. We introduce a novel, linear time, uniform sampling algorithm for RNA structures of fixed topological genus g, for arbitrary g>0. Furthermore we develop a linear time sampling algorithm for RNA structures of fixed topological genus g that are weighted by a simplified, loop-based energy functional. For this process the partition function of the energy functional has to be computed once, which has O(n(2)) time complexity.


Subject(s)
Nucleic Acid Conformation , RNA/chemistry , Algorithms , Computational Biology , Mathematical Concepts , Models, Molecular
6.
J Comput Biol ; 19(10): 1134-50, 2012 Oct.
Article in English | MEDLINE | ID: mdl-23057823

ABSTRACT

Predicting RNA structures with pseudoknots in general is an NP-complete problem. Accordingly, several authors have suggested subclasses that provide polynomial time prediction algorithms by allowing (respectively, disallowing) certain structural motives. In this article, we introduce a unifying algebraic view on most of these classes. That way it becomes possible to find linear time recognition algorithms that decide whether or not a given structure is member of a class (we offer these algorithms as a web service to the scientific community). Furthermore, by presenting a general translation scheme of our algebraic descriptions into multiple context-free grammars, and proving a new correspondence of multiple context-free grammars and generating functions, it becomes possible to derive the precise asymptotic size of all the classes, solving some open problems such as enumerating the Rivas & Eddy class of pseudoknots.


Subject(s)
Algorithms , Models, Chemical , Models, Molecular , Nucleic Acid Conformation , RNA , RNA/chemistry , RNA/genetics
7.
BMC Bioinformatics ; 13: 159, 2012 Jul 09.
Article in English | MEDLINE | ID: mdl-22776037

ABSTRACT

BACKGROUND: Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other. RESULTS: In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case - without sacrificing much of the accuracy of the results. CONCLUSIONS: Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms.


Subject(s)
Models, Statistical , Nucleic Acid Conformation , RNA/chemistry , Algorithms , Bayes Theorem , Probability , RNA/genetics , Software , Thermodynamics
9.
Theory Biosci ; 130(4): 313-36, 2011 Dec.
Article in English | MEDLINE | ID: mdl-22135038

ABSTRACT

Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Over the past decades, mainly two different approaches have been considered to compute predictions of RNA secondary structures from a single sequence: the first one relies on physics-based and the other on probabilistic RNA models. Particularly, the free energy minimization (MFE) approach is usually considered the most popular and successful method. Moreover, based on the paradigm-shifting work by McCaskill which proposes the computation of partition functions (PFs) and base pair probabilities based on thermodynamics, several extended partition function algorithms, statistical sampling methods and clustering techniques have been invented over the last years. However, the accuracy of the corresponding algorithms is limited by the quality of underlying physics-based models, which include a vast number of thermodynamic parameters and are still incomplete. The competing probabilistic approach is based on stochastic context-free grammars (SCFGs) or corresponding generalizations, like conditional log-linear models (CLLMs). These methods abstract from free energies and instead try to learn about the structural behavior of the molecules by learning (a manageable number of) probabilistic parameters from trusted RNA structure databases. In this work, we introduce and evaluate a sophisticated SCFG design that mirrors state-of-the-art physics-based RNA structure prediction procedures by distinguishing between all features of RNA that imply different energy rules. This SCFG actually serves as the foundation for a statistical sampling algorithm for RNA secondary structures of a single sequence that represents a probabilistic counterpart to the sampling extension of the PF approach. Furthermore, some new ways to derive meaningful structure predictions from generated sample sets are presented. They are used to compare the predictive accuracy of our model to that of other probabilistic and energy-based prediction methods. Particularly, comparisons to lightweight SCFGs and corresponding CLLMs for RNA structure prediction indicate that more complex SCFG designs might yield higher accuracy but eventually require more comprehensive and pure training sets. Investigations on both the accuracies of predicted foldings and the overall quality of generated sample sets (especially on an abstraction level, called abstract shapes of generated structures, that is relevant for biologists) yield the conclusion that the Boltzmann distribution of the PF sampling approach is more centered than the ensemble distribution induced by the sophisticated SCFG model, which implies a greater structural diversity within generated samples. In general, neither of the two distinct ensemble distributions is more adequate than the other and the corresponding results obtained by statistical sampling can be expected to bare fundamental differences, such that the method to be preferred for a particular input sequence strongly depends on the considered RNA type.


Subject(s)
Models, Chemical , Models, Statistical , Nucleic Acid Conformation , RNA/chemistry , Algorithms , Stochastic Processes , Thermodynamics
10.
J Bioinform Comput Biol ; 9(6): 749-73, 2011 Dec.
Article in English | MEDLINE | ID: mdl-22084012

ABSTRACT

BACKGROUND: The study of microbial diversity and community structures heavily relies on the analyses of sequence data, predominantly taxonomic marker genes like the small subunit of the ribosomal RNA (SSU rRNA) amplified from environmental samples. Until recently, the "gold standard" for this strategy was the cloning and Sanger sequencing of amplified target genes, usually restricted to a few hundred sequences per sample due to relatively high costs and labor intensity. The recent introduction of massive parallel tag sequencing strategies like pyrosequencing (454 sequencing) has opened a new window into microbial biodiversity research. Due to its swift nature and relatively low expense, this strategy produces millions of environmental SSU rDNA sequences granting the opportunity to gain deep insights into the true diversity and complexity of microbial communities. The bottleneck, however, is the computational processing of these massive sequence data, without which, biologists are hardly able to exploit the full information included in these sequence data. RESULTS: The freely available standalone software package JAGUC implements a broad regime of different functions, allowing for efficient and convenient processing of a huge number of sequence tags, including importing custom-made reference data bases for basic local alignment searches, user-defined quality and search filters for analyses of specific sets of sequences, pairwise alignment-based sequence similarity calculations and clustering as well as sampling saturation and rank abundance analyses. In initial applications, JAGUC successfully analyzed hundreds of thousands of sequence data (eukaryote SSU rRNA genes) from aquatic samples and also was applied for quality assessments of different pyrosequencing platforms. CONCLUSIONS: The new program package JAGUC is a tool that bridges the gap between computational and biological sciences. It enables biologists to process large sequence data sets in order to infer biological meaning from hundreds of thousands of raw sequence data. JAGUC offers advantages over available tools which are further discussed in this manuscript.


Subject(s)
Biodiversity , Software , Base Sequence , DNA, Ribosomal/chemistry , Eukaryota , Genes, rRNA , Phylogeny , Sequence Analysis, DNA
11.
Algorithms Mol Biol ; 6: 24, 2011 Oct 12.
Article in English | MEDLINE | ID: mdl-21992500

ABSTRACT

BACKGROUND: Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest. RESULTS: In this article, we present a new general framework for deriving algorithms for the non-uniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic context-free grammar. Briefly, the framework extends on the well-known recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the non-uniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from real-life data by using a very detailed context-free grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to well-known sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time O(n2) for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worst-case time complexity Om⋅n⋅ log(n) while other algorithms typically have a runtime in O(m⋅n2). Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities. CONCLUSION: A number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://wwwagak.cs.uni-kl.de/NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.

12.
J Comput Biol ; 18(12): 1793-806, 2011 Dec.
Article in English | MEDLINE | ID: mdl-21417777

ABSTRACT

In this article, we compute the limit distributions of the numbers of hairpin-loops, interior-loops and bulges in k-noncrossing RNA structures. The latter are coarse-grained RNA structures allowing for cross-serial interactions, subject to the constraint that there are at most k - 1 mutually crossing arcs in the diagram representation of the molecule. We prove central limit theorems by means of studying the corresponding bivariate generating functions. These generating functions are obtained by symbolic inflation of [Formula: see text]-shapes introduced by Reidys and Wang (2009).


Subject(s)
Nucleic Acid Conformation , RNA/chemistry , Base Sequence , Models, Molecular , Molecular Sequence Data , RNA, Transfer, Amino Acyl/chemistry , RNA, Transfer, Amino Acyl/genetics
13.
Bioinformatics ; 27(8): 1076-85, 2011 Apr 15.
Article in English | MEDLINE | ID: mdl-21335320

ABSTRACT

MOTIVATION: Several dynamic programming algorithms for predicting RNA structures with pseudoknots have been proposed that differ dramatically from one another in the classes of structures considered. RESULTS: Here, we use the natural topological classification of RNA structures in terms of irreducible components that are embeddable in the surfaces of fixed genus. We add to the conventional secondary structures four building blocks of genus one in order to construct certain structures of arbitrarily high genus. A corresponding unambiguous multiple context-free grammar provides an efficient dynamic programming approach for energy minimization, partition function and stochastic sampling. It admits a topology-dependent parametrization of pseudoknot penalties that increases the sensitivity and positive predictive value of predicted base pairs by 10-20% compared with earlier approaches. More general models based on building blocks of higher genus are also discussed. AVAILABILITY: The source code of gfold is freely available at http://www.combinatorics.cn/cbpc/gfold.tar.gz. CONTACT: duck@santafe.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
RNA/chemistry , Algorithms , Base Pairing , Nucleic Acid Conformation , RNA/classification , Sequence Analysis, RNA , Software
14.
Environ Microbiol ; 13(2): 340-9, 2011 Feb.
Article in English | MEDLINE | ID: mdl-21281421

ABSTRACT

Initial environmental pyrosequencing studies suggested highly complex protistan communities with phylotype richness decisively higher than previously estimated. However, recent studies on individual bacteria or artificial bacterial communities evidenced that pyrosequencing errors may skew our view of the true complexity of microbial communities. We pyrosequenced two diversity markers (hypervariable regions V4 and V9 of the small-subunit rDNA) of an intertidal protistan model community, using the Roche GS-FLX and the most recent GS-FLX Titanium sequencing systems. After pyrosequencing 24 reference sequences we obtained up to 2039 unique tags (from 3879 V4 GS-FLX Titanium reads), 77% of which were singletons. Even binning sequences that share 97% similarity still emulated a pseudodiversity exceeding the true complexity of the model community up to three times (V9 GS-FLX). Pyrosequencing error rates were higher for V4 fragments compared with the V9 domain and for the GS-FLX Titanium compared with the GS-FLX system. Furthermore, this experiment revealed that error rates are taxon-specific. As an outcome of this study we suggest a fast and efficient strategy to discriminate pyrosequencing signals from noise in order to more realistically depict the structure of protistan communities using simple tools that are implemented in standard tag data-processing pipelines.


Subject(s)
Bacteria/classification , DNA, Bacterial/genetics , Genes, rRNA , Sequence Analysis, DNA/methods , Bacteria/genetics , Biodiversity , Cluster Analysis
15.
Environ Microbiol Rep ; 3(2): 154-8, 2011 Apr.
Article in English | MEDLINE | ID: mdl-23761246

ABSTRACT

Delineating operational taxonomic units (OTUs) is a central element in any culture-independent analysis of environmental microbial eukaryotic diversity. Previous studies either have not justified their choice in sequence distance used to bin small-subunit ribosomal RNA (SSU rRNA) gene sequences amplified from environmental samples into OTUs, or have used a value based on the average across a broad sampling of microbial eukaryotes. Here, we analyse distances (320 922 pairwise comparisons) among sequences just from identified ciliates, and compare these with their taxonomic hierarchy. Our results show that no single sequence similarity value can always and unambiguously delineate species boundaries and higher taxa. Nevertheless, we suggest the use of 98% similarity to delineate ciliate OTUs because this threshold at least accounts for intra-specific polymorphism among multiple rRNA cistron copies. However, we suggest refraining from reconciling SSU rRNA gene-based OTUs and ciliate morphotypes; these OTUs should be used to analyse ciliate phylotype diversity, not ciliate species diversity.

16.
Article in English | MEDLINE | ID: mdl-21116040

ABSTRACT

There are two custom ways for predicting RNA secondary structures: minimizing the free energy of a conformation according to a thermodynamic model and maximizing the probability of a folding according to a stochastic model. In most cases, stochastic grammars are used for the latter alternative applying the maximum likelihood principle for determining a grammar's probabilities. In this paper, building on such a stochastic model, we will analyze the expected minimum free energy of an RNA molecule according to Turner's energy rules. Even if the parameters of our grammar are chosen with respect to structural properties of native molecules only (and therefore, independent of molecules' free energy), we prove formulae for the expected minimum free energy and the corresponding variance as functions of the molecule's size which perfectly fit the native behavior of free energies. This gives proof for a high quality of our stochastic model making it a handy tool for further investigations. In fact, the stochastic model for RNA secondary structures presented in this work has, for example, been used as the basis of a new algorithm for the (nonuniform) generation of random RNA secondary structures.


Subject(s)
Computational Biology/methods , Nucleic Acid Conformation , RNA/chemistry , Models, Molecular , RNA Folding , Thermodynamics
17.
Mol Ecol ; 19 Suppl 1: 21-31, 2010 Mar.
Article in English | MEDLINE | ID: mdl-20331767

ABSTRACT

Sequencing of ribosomal DNA clone libraries amplified from environmental DNA has revolutionized our understanding of microbial eukaryote diversity and ecology. The results of these analyses have shown that protist groups are far more genetically heterogeneous than their morphological diversity suggests. However, the clone library approach is labour-intensive, relatively expensive, and methodologically biased. Therefore, even the most intensive rDNA library analyses have recovered only small samples of much larger assemblages, indicating that global environments harbour a vast array of unexplored biodiversity. High-throughput parallel tag 454 sequencing offers an unprecedented scale of sampling for molecular detection of microbial diversity. Here, we report a 454 protocol for sampling and characterizing assemblages of eukaryote microbes. We use this approach to sequence two SSU rDNA diversity markers-the variable V4 and V9 regions-from 10 L of anoxic Norwegian fjord water. We identified 38 116 V4 and 15 156 V9 unique sequences. Both markers detect a wide range of taxonomic groups but in both cases the diversity detected was dominated by dinoflagellates and close relatives. Long-tailed rank abundance curves suggest that the 454 sequencing approach provides improved access to rare genotypes. Most tags detected represent genotypes not currently in GenBank, although many are similar to database sequences. We suggest that current understanding of the ecological complexity of protist communities, genetic diversity, and global species richness are severely limited by the sequence data hitherto available, and we discuss the biological significance of this high amplicon diversity.


Subject(s)
Biodiversity , DNA, Ribosomal/analysis , Seawater/microbiology , Sequence Analysis, DNA/methods , Water Microbiology , Cluster Analysis , Computational Biology , DNA, Ribosomal/genetics , DNA, Ribosomal/isolation & purification , Dinoflagellida/classification , Dinoflagellida/genetics , Gene Library , Sequence Tagged Sites
18.
Theory Biosci ; 128(4): 211-25, 2009 Nov.
Article in English | MEDLINE | ID: mdl-19756808

ABSTRACT

Over the last few decades, much effort has been taken to develop approaches for identifying good predictions of RNA secondary structure. This is due to the fact that most computational prediction methods based on free energy minimization compute a number of suboptimal foldings and we have to identify the native folding among all these possible secondary structures. Using the abstract shapes approach as introduced by Giegerich et al. (Nucleic Acids Res 32(16):4843-4851, 2004), each class of similar secondary structures is represented by one shape and the native structures can be found among the top shape representatives. In this article, we derive some interesting results answering enumeration problems for abstract shapes and secondary structures of RNA. We compute precise asymptotics for the number of different shape representations of size n and for the number of different shapes showing up when abstracting from secondary structures of size n under a combinatorial point of view. A more realistic model taking primary structures into account remains an open challenge. We give some arguments why the present techniques cannot be applied in this case.


Subject(s)
Computational Biology , Models, Molecular , Nucleic Acid Conformation , RNA/chemistry , Algorithms , Base Pairing , Base Sequence , RNA/genetics
19.
J Math Biol ; 56(1-2): 161-81, 2008 Jan.
Article in English | MEDLINE | ID: mdl-17589847

ABSTRACT

The most probable secondary structure of an RNA molecule, given the nucleotide sequence, can be computed efficiently if a stochastic context-free grammar (SCFG) is used as the prior distribution of the secondary structure. The structures of some RNA molecules contain so-called pseudoknots. Allowing all possible configurations of pseudoknots is not compatible with context-free grammar models and makes the search for an optimal secondary structure NP-complete. We suggest a probabilistic model for RNA secondary structures with pseudoknots and present a Markov-chain Monte-Carlo Method for sampling RNA structures according to their posterior distribution for a given sequence. We favor Bayesian sampling over optimization methods in this context, because it makes the uncertainty of RNA structure predictions assessable. We demonstrate the benefit of our method in examples with tmRNA and also with simulated data. McQFold, an implementation of our method, is freely available from http://www.cs.uni-frankfurt.de/~metzler/McQFold.


Subject(s)
Computational Biology/methods , Markov Chains , Models, Chemical , Nucleic Acid Conformation , RNA, Messenger/chemistry , Monte Carlo Method , RNA, Messenger/genetics , Software , Treponema pallidum/genetics
20.
Bull Math Biol ; 66(5): 925-64, 2004 Sep.
Article in English | MEDLINE | ID: mdl-15294413

ABSTRACT

Within this paper we investigate the Bernoulli model for random secondary structures of ribonucleic acid (RNA) molecules. Assuming that two random bases can form a hydrogen bond with probability p we prove asymptotic equivalents for the averaged number of hairpins and bulges, the averaged loop length, the expected order, the expected number of secondary structures of size n and order k and further parameters all depending on p. In this way we get an insight into the change of shape of a random structure during the process 1 p -->0. Afterwards we compare the computed parameters for random structures in the Bernoulli model to the corresponding quantities for real existing secondary structures of large subunit rRNA molecules found in the database of Wuyts et al. That is how it becomes possible to identify those parameters which behave (almost) randomly and those which do not and thus should be considered as interesting, e.g., with respect to the biological functions or the algorithmic prediction of RNA secondary structures.


Subject(s)
Models, Chemical , RNA/chemistry , Hydrogen Bonding , Models, Molecular , Nucleic Acid Conformation
SELECTION OF CITATIONS
SEARCH DETAIL
...