Search | VHL Regional Portal

Reconstructing phylogenies from noisy quartets in polynomial time with a high success probability.

Wu, Gang; Kao, Ming-Yang; Lin, Guohui; You, Jia-Huai.

Algorithms Mol Biol ; 3: 1, 2008 Jan 24.

Article in English | MEDLINE | ID: mdl-18218120

ABSTRACT

BACKGROUND: In recent years, quartet-based phylogeny reconstruction methods have received considerable attentions in the computational biology community. Traditionally, the accuracy of a phylogeny reconstruction method is measured by simulations on synthetic datasets with known "true" phylogenies, while little theoretical analysis has been done. In this paper, we present a new model-based approach to measuring the accuracy of a quartet-based phylogeny reconstruction method. Under this model, we propose three efficient algorithms to reconstruct the "true" phylogeny with a high success probability. RESULTS: The first algorithm can reconstruct the "true" phylogeny from the input quartet topology set without quartet errors in O(n2) time by querying at most (n - 4) log(n - 1) quartet topologies, where n is the number of the taxa. When the input quartet topology set contains errors, the second algorithm can reconstruct the "true" phylogeny with a probability approximately 1 - p in O(n4 log n) time, where p is the probability for a quartet topology being an error. This probability is improved by the third algorithm to approximately [equation; see text], where [equation, see text], with running time of O(n5), which is at least 0.984 when p < 0.05. CONCLUSION: The three proposed algorithms are mathematically guaranteed to reconstruct the "true" phylogeny with a high success probability. The experimental results showed that the third algorithm produced phylogenies with a higher probability than its aforementioned theoretical lower bound and outperformed some existing phylogeny reconstruction methods in both speed and accuracy.

Design optimization methods for genomic DNA tiling arrays.

Bertone, Paul; Trifonov, Valery; Rozowsky, Joel S; Schubert, Falk; Emanuelsson, Olof; Karro, John; Kao, Ming-Yang; Snyder, Michael; Gerstein, Mark.

Genome Res ; 16(2): 271-81, 2006 Feb.

Article in English | MEDLINE | ID: mdl-16365382

ABSTRACT

A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.

Subject(s)

Algorithms , Gene Expression Profiling , Genome, Human , Interspersed Repetitive Sequences , Oligonucleotide Array Sequence Analysis , Animals , Gene Expression Profiling/methods , Gene Expression Profiling/standards , Genome, Human/genetics , Humans , Interspersed Repetitive Sequences/genetics , Oligonucleotide Array Sequence Analysis/methods , Oligonucleotide Array Sequence Analysis/standards , Reproducibility of Results , Sensitivity and Specificity

Fast optimal genome tiling with applications to microarray design and homology search.

Berman, Piotr; Bertone, Paul; Dasgupta, Bhaskar; Gerstein, Mark; Kao, Ming-Yang; Snyder, Michael.

J Comput Biol ; 11(4): 766-85, 2004.

Article in English | MEDLINE | ID: mdl-15579244

ABSTRACT

In this paper, we consider several variations of the following basic tiling problem: given a sequence of real numbers with two size-bound parameters, we want to find a set of tiles of maximum total weight such that each tiles satisfies the size bounds. A solution to this problem is important to a number of computational biology applications such as selecting genomic DNA fragments for PCR-based amplicon microarrays and performing homology searches with long sequence queries. Our goal is to design efficient algorithms with linear or near-linear time and space in the normal range of parameter values for these problems. For this purpose, we first discuss the solution to a basic online interval maximum problem via a sliding-window approach and show how to use this solution in a nontrivial manner for many of the tiling problems introduced. We also discuss NP-hardness results and approximation algorithms for generalizing our basic tiling problem to higher dimensions. Finally, computational results from applying our tiling algorithms to genomic sequences of five model eukaryotes are reported.

Subject(s)

Genomics/statistics & numerical data , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Sequence Alignment/statistics & numerical data , Algorithms , Computational Biology , DNA/genetics , Mathematics , Models, Genetic

Predicting RNA secondary structures with arbitrary pseudoknots by maximizing the number of stacking pairs.

Ieong, Samuel; Kao, Ming-Yang; Lam, Tak-Wah; Sung, Wing-Kin; Yiu, Siu-Ming.

J Comput Biol ; 10(6): 981-95, 2003.

Article in English | MEDLINE | ID: mdl-14980021

ABSTRACT

The paper investigates the computational problem of predicting RNA secondary structures. The general belief is that allowing pseudoknots makes the problem hard. Existing polynomial-time algorithms are heuristic algorithms with no performance guarantee and can handle only limited types of pseudoknots. In this paper, we initiate the study of predicting RNA secondary structures with a maximum number of stacking pairs while allowing arbitrary pseudoknots. We obtain two approximation algorithms with worst-case approximation ratios of 1/2 and 1/3 for planar and general secondary structures, respectively. For an RNA sequence of n bases, the approximation algorithm for planar secondary structures runs in O(n(3)) time while that for the general case runs in linear time. Furthermore, we prove that allowing pseudoknots makes it NP-hard to maximize the number of stacking pairs in a planar secondary structure. This result is in contrast with the recent NP-hard results on psuedoknots which are based on optimizing some general and complicated energy functions.

Subject(s)

Computational Biology/methods , Nucleic Acid Conformation , RNA/chemistry , Algorithms , Base Pairing , Base Sequence , Molecular Sequence Data , RNA/genetics

A combinatorial toolbox for protein sequence design and landscape analysis in the grand canonical model.

Aspnes, James; Hartling, Julia; Kao, Ming-Yang; Kim, Junhyong; Shah, Gauri.

J Comput Biol ; 9(5): 721-41, 2002.

Article in English | MEDLINE | ID: mdl-12487760

ABSTRACT

In modern biology, one of the most important research problems is to understand how protein sequences fold into their native 3D structures. To investigate this problem at a high level, one wishes to analyze the protein landscapes, i.e., the structures of the space of all protein sequences and their native 3D structures. Perhaps the most basic computational problem at this level is to take a target 3D structure as input and design a fittest protein sequence with respect to one or more fitness functions of the target 3D structure. We develop a toolbox of combinatorial techniques for protein landscape analysis in the Grand Canonical model of Sun, Brem, Chan, and Dill. The toolbox is based on linear programming, network flow, and a linear-size representation of all minimum cuts of a network. It not only substantially expands the network flow technique for protein sequence design in Kleinberg's seminal work but also is applicable to a considerably broader collection of computational problems than those considered by Kleinberg. We have used this toolbox to obtain a number of efficient algorithms and hardness results. We have further used the algorithms to analyze 3D structures drawn from the Protein Data Bank and have discovered some novel relationships between such native 3D structures and the Grand Canonical model.

Subject(s)

Models, Theoretical , Protein Conformation , Proteins/chemistry , Software Design , Algorithms , Computer Simulation , Models, Chemical , Mutation , Programming, Linear , Protein Engineering/methods , Proteins/genetics , Reproducibility of Results

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL