Search | VHL Regional Portal

1.

A Resolution of the Static Formulation Question for the Problem of Computing the History Bound.

Matsieva, Julia; Kelk, Steven; Scornavacca, Celine; Whidden, Chris; Gusfield, Dan.

IEEE/ACM Trans Comput Biol Bioinform ; 14(2): 404-417, 2017.

Article in English | MEDLINE | ID: mdl-26887004

ABSTRACT

Evolutionary data has been traditionally modeled via phylogenetic trees; however, branching alone cannot model conflicting phylogenetic signals, so networks are used instead. Ancestral recombination graphs (ARGs) are used to model the evolution of incompatible sets of SNP data, allowing each site to mutate only once. The model often aims to minimize the number of recombinations. Similarly, incompatible cluster data can be represented by a reticulation network that minimizes reticulation events. The ARG literature has traditionally been disjoint from the reticulation network literature. By building on results from the reticulation network literature, we resolve an open question of interest to the ARG community. We explicitly prove that the History Bound, a lower bound on the number of recombinations in an ARG for a binary matrix, which was previously only defined procedurally, is equal to the minimum number of reticulation nodes in a network for the corresponding cluster data. To facilitate the proof, we give an algorithm that constructs this network using intermediate values from the procedural History Bound definition. We then develop a top-down algorithm for computing the History Bound, which has the same worst-case runtime as the known dynamic program, and show that it is likely to run faster in typical cases.

Subject(s)

Algorithms , Computational Biology/methods , Evolution, Molecular , Phylogeny , Recombination, Genetic/genetics , Cluster Analysis , Models, Genetic , Polymorphism, Single Nucleotide

2.

An improved Four-Russians method and sparsified Four-Russians algorithm for RNA folding.

Frid, Yelena; Gusfield, Dan.

Algorithms Mol Biol ; 11: 22, 2016.

Article in English | MEDLINE | ID: mdl-27499801

ABSTRACT

BACKGROUND: The basic RNA secondary structure prediction problem or single sequence folding problem (SSF) was solved 35 years ago by a now well-known [Formula: see text]-time dynamic programming method. Recently three methodologies-Valiant, Four-Russians, and Sparsification-have been applied to speedup RNA secondary structure prediction. The sparsification method exploits two properties of the input: the number of subsequence Z with the endpoints belonging to the optimal folding set and the maximum number base-pairs L. These sparsity properties satisfy [Formula: see text] and [Formula: see text], and the method reduces the algorithmic running time to O(LZ). While the Four-Russians method utilizes tabling partial results. RESULTS: In this paper, we explore three different algorithmic speedups. We first expand the reformulate the single sequence folding Four-Russians [Formula: see text]-time algorithm, to utilize an on-demand lookup table. Second, we create a framework that combines the fastest Sparsification and new fastest on-demand Four-Russians methods. This combined method has worst-case running time of [Formula: see text], where [Formula: see text] and [Formula: see text]. Third we update the Four-Russians formulation to achieve an on-demand [Formula: see text]-time parallel algorithm. This then leads to an asymptotic speedup of [Formula: see text] where [Formula: see text] and [Formula: see text] the number of subsequence with the endpoint j belonging to the optimal folding set. CONCLUSIONS: The on-demand formulation not only removes all extraneous computation and allows us to incorporate more realistic scoring schemes, but leads us to take advantage of the sparsity properties. Through asymptotic analysis and empirical testing on the base-pair maximization variant and a more biologically informative scoring scheme, we show that this Sparse Four-Russians framework is able to achieve a speedup on every problem instance, that is asymptotically never worse, and empirically better than achieved by the minimum of the two methods alone.

3.

Faster algorithms for RNA-folding using the Four-Russians method.

Venkatachalam, Balaji; Gusfield, Dan; Frid, Yelena.

Algorithms Mol Biol ; 9(1): 5, 2014 Mar 06.

Article in English | MEDLINE | ID: mdl-24602450

ABSTRACT

BACKGROUND: The secondary structure that maximizes the number of non-crossing matchings between complimentary bases of an RNA sequence of length n can be computed in O(n3) time using Nussinov's dynamic programming algorithm. The Four-Russians method is a technique that reduces the running time for certain dynamic programming algorithms by a multiplicative factor after a preprocessing step where solutions to all smaller subproblems of a fixed size are exhaustively enumerated and solved. Frid and Gusfield designed an O(n3logn) algorithm for RNA folding using the Four-Russians technique. In their algorithm the preprocessing is interleaved with the algorithm computation. THEORETICAL RESULTS: We simplify the algorithm and the analysis by doing the preprocessing once prior to the algorithm computation. We call this the two-vector method. We also show variants where instead of exhaustive preprocessing, we only solve the subproblems encountered in the main algorithm once and memoize the results. We give a simple proof of correctness and explore the practical advantages over the earlier method.The Nussinov algorithm admits an O(n2) time parallel algorithm. We show a parallel algorithm using the two-vector idea that improves the time bound to O(n2logn). PRACTICAL RESULTS: We have implemented the parallel algorithm on graphics processing units using the CUDA platform. We discuss the organization of the data structures to exploit coalesced memory access for fast running times. The ideas to organize the data structures also help in improving the running time of the serial algorithms. For sequences of length up to 6000 bases the parallel algorithm takes only about 2.5 seconds and the two-vector serial method takes about 57 seconds on a desktop and 15 seconds on a server. Among the serial algorithms, the two-vector and memoized versions are faster than the Frid-Gusfield algorithm by a factor of 3, and are faster than Nussinov by up to a factor of 20. The source-code for the algorithms is available at http://github.com/ijalabv/FourRussiansRNAFolding.

4.

Constructing perfect phylogenies and proper triangulations for three-state characters.

Gysel, Rob; Lam, Fumei; Gusfield, Dan.

Algorithms Mol Biol ; 7(1): 26, 2012 Sep 24.

Article in English | MEDLINE | ID: mdl-23006612

ABSTRACT

: In this paper, we study the problem of constructing perfect phylogenies for three-state characters. Our work builds on two recent results. The first result states that for three-state characters, the local condition of examining all subsets of three characters is sufficient to determine the global property of admitting a perfect phylogeny. The second result applies tools from minimal triangulation theory to the partition intersection graph to determine if a perfect phylogeny exists. Despite the wealth of combinatorial tools and algorithms stemming from the chordal graph and minimal triangulation literature, it is unclear how to use such approaches to efficiently construct a perfect phylogeny for three-state characters when the data admits one. We utilize structural properties of both the partition intersection graph and the original data in order to achieve a competitive time bound.

5.

Extensions and improvements to the chordal graph approach to the multistate perfect phylogeny problem.

Gysel, Rob; Gusfield, Dan.

IEEE/ACM Trans Comput Biol Bioinform ; 8(4): 912-7, 2011.

Article in English | MEDLINE | ID: mdl-21301033

ABSTRACT

The multistate perfect phylogeny problem is a classic problem in computational biology. When no perfect phylogeny exists, it is of interest to find a set of characters to remove in order to obtain a perfect phylogeny in the remaining data. This is known as the character removal problem. We show how to use chordal graphs and triangulations to solve the character removal problem for an arbitrary number of states, which was previously unsolved. We outline a preprocessing technique that speeds up the computation of the minimal separators of a graph. Minimal separators are used in our solution to the missing data character removal problem and to Gusfield's solution of the perfect phylogeny problem with missing data.

Subject(s)

Algorithms , Computational Biology/methods , Models, Genetic , Phylogeny

6.

Untangling tanglegrams: comparing trees by their drawings.

Venkatachalam, Balaji; Apple, Jim; St John, Katherine; Gusfield, Dan.

IEEE/ACM Trans Comput Biol Bioinform ; 7(4): 588-97, 2010.

Article in English | MEDLINE | ID: mdl-20530818

ABSTRACT

A tanglegram is a pair of trees on the same set of leaves with matching leaves in the two trees joined by an edge. Tanglegrams are widely used in biology--to compare evolutionary histories of host and parasite species and to analyze genes of species in the same geographical area. We consider optimization problems in tanglegram drawings. We show a linear time algorithm to decide if a tanglegram admits a planar embedding by a reduction to the planar graph drawing problem. This problem was also studied by Fernau et al. A similar reduction to a graph crossing problem also helps to solve an open problem they posed, showing a fixed-parameter tractable algorithm for minimizing the number of crossings over all d-ary trees. For the case where one tree is fixed, we show an O(n log n) algorithm to determine the drawing of the second tree that minimizes the number of crossings. This improves the bound from earlier methods. We introduce a new optimization criterion using Spearman's footrule distance and give an O(n²) algorithm. We also show integer programming formulations to quickly obtain tanglegram drawings that minimize the two optimization measures discussed. We prove lower bounds on the maximum gap between the optimal solution and the heuristic of Dwyer and Schreiber to minimize crossings.

Subject(s)

Algorithms , Phylogeny , Evolution, Molecular

7.

The Multi-State Perfect Phylogeny Problem with missing and removable data: solutions via integer-programming and chordal graph theory.

Gusfield, Dan.

J Comput Biol ; 17(3): 383-99, 2010 Mar.

Article in English | MEDLINE | ID: mdl-20377452

ABSTRACT

The Multi-State Perfect Phylogeny Problem is an extension of the Binary Perfect Phylogeny Problem, allowing characters to take on more than two states. In this article, we consider three problems that extend the utility of the multi-state perfect phylogeny model: (1) the Missing Data (MD) Problem, where some entries in the input are missing and the question is whether (bounded) values for the missing data can be imputed so that the resulting data has a multi-state perfect phylogeny; (2) the Character-Removal (CR) Problem, where we want to minimize the number of characters to remove from the data so that the resulting data has a multi-state perfect phylogeny; and (3) the Missing-Data Character-Removal (MDCR) Problem, where the input has missing data and we want to impute values for the missing data to minimize the solution to the resulting Character-Removal Problem. We discuss Integer Linear Programming (ILP) solutions to these problems for the special case of three, four, and five permitted states per character, and we report on extensive empirical testing of these solutions. Then we develop a general theory to solve the MD problem for an arbitrary number of permitted states, using chordal graph theory and results on minimal triangulation of non-chordal graphs. This establishes new necessary and sufficient conditions for the existence of a perfect phylogeny with (or without) missing data. We implement the general theory using integer linear programming, although other optimization methods are possible. We extensively explore the empirical behavior of the general solution, showing that the methods are very practical for data of size and complexity that is characteristic of many current applications in phylogenetics. Some of the empirical results for the MD problem with an arbitrary number of permitted states are very surprising, suggesting the existence of additional combinatorial structure in multi-state perfect phylogenies. Finally, we note some relationships between our chordal-graph approach to the multi-state perfect phylogeny, without missing data, and prior methods.

Subject(s)

Phylogeny , Programming, Linear , Statistics as Topic , Algorithms , Animals , Computational Biology , Time Factors

8.

A simple, practical and complete O(n3/log n)-time algorithm for RNA folding using the Four-Russians speedup.

Frid, Yelena; Gusfield, Dan.

Algorithms Mol Biol ; 5: 13, 2010 Jan 04.

Article in English | MEDLINE | ID: mdl-20047670

ABSTRACT

BACKGROUND: The problem of computationally predicting the secondary structure (or folding) of RNA molecules was first introduced more than thirty years ago and yet continues to be an area of active research and development. The basic RNA-folding problem of finding a maximum cardinality, non-crossing, matching of complimentary nucleotides in an RNA sequence of length n, has an O(n3)-time dynamic programming solution that is widely applied. It is known that an o(n3) worst-case time solution is possible, but the published and suggested methods are complex and have not been established to be practical. Significant practical improvements to the original dynamic programming method have been introduced, but they retain the O(n3) worst-case time bound when n is the only problem-parameter used in the bound. Surprisingly, the most widely-used, general technique to achieve a worst-case (and often practical) speed up of dynamic programming, the Four-Russians technique, has not been previously applied to the RNA-folding problem. This is perhaps due to technical issues in adapting the technique to RNA-folding. RESULTS: In this paper, we give a simple, complete, and practical Four-Russians algorithm for the basic RNA-folding problem, achieving a worst-case time-bound of O(n3/log(n)). CONCLUSIONS: We show that this time-bound can also be obtained for richer nucleotide matching scoring-schemes, and that the method achieves consistent speed-ups in practice. The contribution is both theoretical and practical, since the basic RNA-folding problem is often solved multiple times in the inner-loop of more complex algorithms, and for long RNA molecules in the study of RNA virus genomes.

9.

Algorithms to distinguish the role of gene-conversion from single-crossover recombination in the derivation of SNP sequences in populations.

Song, Yun S; Ding, Zhihong; Gusfield, Dan; Langley, Charles H; Wu, Yufeng.

J Comput Biol ; 14(10): 1273-86, 2007 Dec.

Article in English | MEDLINE | ID: mdl-18047424

ABSTRACT

Meiotic recombination is a fundamental biological event and one of the principal evolutionary forces responsible for shaping genetic variation within species. In addition to its fundamental role, recombination is central to several critical applied problems. The most important example is "association mapping" in populations, which is widely hoped to help find genes that influence genetic diseases (Carlson et al., 2004; Clark, 2003). Hence, a great deal of recent attention has focused on problems of inferring the historical derivation of sequences in populations when both mutations and recombinations have occurred. In the algorithms literature, most of that recent work has been directed to single-crossover recombination. However, gene-conversion is an important, and more common, form of (two-crossover) recombination which has been much less investigated in the algorithms literature. In this paper, we explicitly incorporate gene-conversion into discrete methods to study historical recombination. We are concerned with algorithms for identifying and locating the extent of historical crossing-over and gene-conversion (along with single-nucleotide mutation), and problems of constructing full putative histories of those events. The novel technical issues concern the incorporation of gene-conversion into recently developed discrete methods (Myers and Griffiths, 2003; Song et al., 2005) that compute lower and upper-bound information on the amount of needed recombination without gene-conversion. We first examine the most natural extension of the lower bound methods from Myers and Griffiths (2003), showing that the extension can be computed efficiently, but that this extension can only yield weak lower bounds. We then develop additional ideas that lead to higher lower bounds, and show how to solve, via integer-linear programming, a more biologically realistic version of the lower bound problem. We also show how to compute effective upper bounds on the number of needed single-crossovers and gene-conversions, along with explicit networks showing a putative history of mutations, single-crossovers and gene-conversions. Both lower and upper bound methods can handle data with missing entries, and the upper bound method can be used to infer missing entries with high accuracy. We validate the significance of these methods by showing that they can be effectively used to distinguish simulation-derived sequences generated without gene-conversion from sequences that were generated with gene-conversion. We apply the methods to recently studied sequences of Arabidopsis thaliana, identifying many more regions in the sequences than were previously identified (Plagnol et al., 2006), where gene-conversion may have played a significant role. Demonstration software is available at www.csif.cs.ucdavis.edu/~gusfield.

Subject(s)

Algorithms , Arabidopsis/genetics , Crossing Over, Genetic/genetics , Gene Conversion , Polymorphism, Single Nucleotide/genetics , Sequence Analysis, DNA/methods , Computer Simulation , Databases, Genetic , Meiosis/genetics , Software

10.

A decomposition theory for phylogenetic networks and incompatible characters.

Gusfield, Dan; Bansal, Vikas; Bafna, Vineet; Song, Yun S.

J Comput Biol ; 14(10): 1247-72, 2007 Dec.

Article in English | MEDLINE | ID: mdl-18047426

ABSTRACT

Phylogenetic networks are models of evolution that go beyond trees, incorporating non-tree-like biological events such as recombination (or more generally reticulation), which occur either in a single species (meiotic recombination) or between species (reticulation due to lateral gene transfer and hybrid speciation). The central algorithmic problems are to reconstruct a plausible history of mutations and non-tree-like events, or to determine the minimum number of such events needed to derive a given set of binary sequences, allowing one mutation per site. Meiotic recombination, reticulation and recurrent mutation can cause conflict or incompatibility between pairs of sites (or characters) of the input. Previously, we used "conflict graphs" and "incompatibility graphs" to compute lower bounds on the minimum number of recombination nodes needed, and to efficiently solve constrained cases of the minimization problem. Those results exposed the structural and algorithmic importance of the non-trivial connected components of those two graphs. In this paper, we more fully develop the structural importance of non-trivial connected components of the incompatibility and conflict graphs, proving a general decomposition theorem (Gusfield and Bansal, 2005) for phylogenetic networks. The decomposition theorem depends only on the incompatibilities in the input sequences, and hence applies to many types of phylogenetic networks, and to any biological phenomena that causes pairwise incompatibilities. More generally, the proof of the decomposition theorem exposes a maximal embedded tree structure that exists in the network when the sequences cannot be derived on a perfect phylogenetic tree. This extends the theory of perfect phylogeny in a natural and important way. The proof is constructive and leads to a polynomial-time algorithm to find the unique underlying maximal tree structure. We next examine and fully solve the major open question from Gusfield and Bansal (2005): Is it true that for every input there must be a fully decomposed phylogenetic network that minimizes the number of recombination nodes used, over all phylogenetic networks for the input. We previously conjectured that the answer is yes. In this paper, we show that the answer in is no, both for the case that only single-crossover recombination is allowed, and also for the case that unbounded multiple-crossover recombination is allowed. The latter case also resolves a conjecture recently stated in (Huson and Klopper, 2007) in the context of reticulation networks. Although the conjecture from Gusfield and Bansal (2005) is disproved in general, we show that the answer to the conjecture is yes in several natural special cases, and establish necessary combinatorial structure that counterexamples to the conjecture must possess. We also show that counterexamples to the conjecture are rare (for the case of single-crossover recombination) in simulated data.

Subject(s)

Models, Genetic , Phylogeny , Algorithms , Databases, Genetic

11.

Efficient computation of minimum recombination with genotypes (not haplotypes).

Wu, Yufeng; Gusfield, Dan.

J Bioinform Comput Biol ; 5(2a): 181-200, 2007 Apr.

Article in English | MEDLINE | ID: mdl-17589959

ABSTRACT

A current major focus in genomics is the large-scale collection of genotype data in populations in order to detect variations in the population. The variation data are sought in order to address fundamental and applied questions in genetics that concern the haplotypes in the population. Since, almost all the collected data is in the form of genotypes, but the downstream genetics questions concern haplotypes, the standard approach to this issue has been to try to first infer haplotypes from the genotypes, and then answer the downstream questions using the inferred haplotypes. That two-stage approach has potential deficiencies, giving rise to the general question of how well one can answer the downstream questions using genotype data without first inferring haplotypes, and giving rise to the goal of computing the range of downstream answers that would be obtained over the range of possible inferred haplotype solutions. This paper provides some tools for the study of those issues, and some partial answers. We present algorithms to solve downstream questions concerning the minimum amount of recombination needed to derive given genotypic data, without first fixing a choice of haplotypes. We apply these algorithms to the goal of finding recombination hotspots, obtaining as good results as a published method that first infers haplotypes; and to the case of estimating the minimum amount of recombination needed to derive the true haplotypes underlying the genotypic data, obtaining weaker results compared to first inferring haplotypes using the program PHASE.

Subject(s)

Chromosome Mapping/methods , Genotype , Haplotypes/genetics , Models, Genetic , Population Dynamics , Recombination, Genetic/genetics , Sequence Analysis, DNA/methods , Computing Methodologies , Evolution, Molecular , Genetic Variation/genetics , Polymorphism, Single Nucleotide/genetics , Software

12.

A linear-time algorithm for the perfect phylogeny haplotyping (PPH) problem.

Ding, Zhihong; Filkov, Vladimir; Gusfield, Dan.

J Comput Biol ; 13(2): 522-53, 2006 Mar.

Article in English | MEDLINE | ID: mdl-16597255

ABSTRACT

Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002 (Gusfield, 2002), the problem of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In this paper, we solve the open problem, giving a practical, deterministic linear-time algorithm based on a simple data structure and simple operations on it. The method is straightforward to program and has been fully implemented. Simulations show that it is much faster in practice than prior nonlinear methods. The value of a linear-time solution to the PPH problem is partly conceptual and partly for use in the inner loop of algorithms for more complex problems, where the PPH problem must be solved repeatedly.

Subject(s)

Algorithms , Computational Biology , Haplotypes , Phylogeny , Data Interpretation, Statistical , Genetic Predisposition to Disease , Humans

13.

Efficient computation of minimum recombination with genotypes (not haplotypes).

Wu, Yufeng; Gusfield, Dan.

Comput Syst Bioinformatics Conf ; : 145-56, 2006.

Article in English | MEDLINE | ID: mdl-17369633

ABSTRACT

A current major focus in genomics is the large-scale collection of genotype data in populations in order to detect variations in the population. The variation data are sought in order to address fundamental and applied questions in genetics that concern the haplotypes in the population. Since almost all the collected data is in the form of genotypes, but the downstream genetics questions concern haplotypes, the standard approach to this issue has been to try to first infer haplotypes from the genotypes, and then answer the downstream questions using the inferred haplotypes. That two-stage approach has potential deficiencies, giving rise to the general question of how well one can answer the downstream questions using genotype data without first inferring haplotypes, and also giving rise to the goal of computing the range of downstream answers that would be obtained over the range of possible inferred haplotype solutions. This paper provides some tools for the study of those issues, and some partial answers. We present algorithms to solve downstream questions concerning the minimum amount of recombination needed to derive given genotypic data, without first fixing a choice of haplotypes. We apply these algorithms to the goal of finding recombination hotspots, obtaining as good results as a published method that first infers haplotypes; and to the case of estimating the minimum amount of recombination needed to derive the true haplotypes underlying the genotypic data, obtaining weaker results compared to first inferring haplotypes using the program PHASE. Hence our tools allow an initial study of the two-stage versus one-stage issue, in the context of specific downstream questions, but our experiments certainly do not fully resolve the issue.

Subject(s)

Computational Biology/methods , Genotype , Haplotypes , Recombination, Genetic , Algorithms , Evolution, Molecular , Models, Genetic , Models, Statistical , Models, Theoretical

14.

Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution.

Song, Yun S; Wu, Yufeng; Gusfield, Dan.

Bioinformatics ; 21 Suppl 1: i413-22, 2005 Jun.

Article in English | MEDLINE | ID: mdl-15961486

ABSTRACT

MOTIVATION: We are interested in studying the evolution of DNA single nucleotide polymorphism sequences which have undergone (meiotic) recombination. For a given set of sequences, computing the minimum number of recombinations needed to explain the sequences (with one mutation per site) is a standard question of interest, but it has been shown to be NP-hard, and previous algorithms that compute it exactly work either only on very small datasets or on problems with special structure. RESULTS: In this paper, we present efficient, practical methods for computing both upper and lower bounds on the minimum number of needed recombinations, and for constructing evolutionary histories that explain the input sequences. We study in detail the efficiency and accuracy of these algorithms on both simulated and real data sets. The algorithms produce very close upper and lower bounds, which match exactly in a surprisingly wide range of data. Thus, with the use of new, very effective lower bounding methods and an efficient algorithm for computing upper bounds, this approach allows the efficient, exact computation of the minimum number of needed recombinations, with high frequency in a large range of data. When upper and lower bounds match, evolutionary histories found by our algorithm correspond to the most parsimonious histories. AVAILABILITY: HapBound and SHRUB, programs implementing the new algorithms discussed in this paper, are available at http://wwwcsif.cs.ucdavis.edu/~gusfield/lu.html

Subject(s)

Computational Biology/methods , Algorithms , DNA/chemistry , Databases, Protein , Evolution, Molecular , Haplotypes , Humans , Polymorphism, Single Nucleotide , Recombination, Genetic , Sequence Alignment , Sequence Analysis/methods , Software

15.

Optimal, efficient reconstruction of phylogenetic networks with constrained recombination.

Gusfield, Dan; Eddhu, Satish; Langley, Charles.

J Bioinform Comput Biol ; 2(1): 173-213, 2004 Mar.

Article in English | MEDLINE | ID: mdl-15272438

ABSTRACT

A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. In a seminal paper, Wang et al.(1) studied the problem of constructing a phylogenetic network, allowing recombination between sequences, with the constraint that the resulting cycles must be disjoint. We call such a phylogenetic network a "galled-tree". They gave a polynomial-time algorithm that was intended to determine whether or not a set of sequences could be generated on galled-tree. Unfortunately, the algorithm by Wang et al.(1) is incomplete and does not constitute a necessary test for the existence of a galled-tree for the data. In this paper, we completely solve the problem. Moreover, we prove that if there is a galled-tree, then the one produced by our algorithm minimizes the number of recombinations over all phylogenetic networks for the data, even allowing multiple-crossover recombinations. We also prove that when there is a galled-tree for the data, the galled-tree minimizing the number of recombinations is "essentially unique". We also note two additional results: first, any set of sequences that can be derived on a galled tree can be derived on a true tree (without recombination cycles), where at most one back mutation per site is allowed; second, the site compatibility problem (which is NP-hard in general) can be solved in polynomial time for any set of sequences that can be derived on a galled tree. Perhaps more important than the specific results about galled-trees, we introduce an approach that can be used to study recombination in general phylogenetic networks. This paper greatly extends the conference version that appears in an earlier work.(8) PowerPoint slides of the conference talk can be found at our website.(7).

Subject(s)

Algorithms , Gene Expression Profiling/methods , Models, Genetic , Phylogeny , Recombination, Genetic/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Genetic Variation , Polymorphism, Single Nucleotide/genetics

16.

A note on efficient computation of haplotypes via perfect phylogeny.

Bafna, Vineet; Gusfield, Dan; Hannenhalli, Sridhar; Yooseph, Shibu.

J Comput Biol ; 11(5): 858-66, 2004.

Article in English | MEDLINE | ID: mdl-15700406

ABSTRACT

The problem of inferring haplotype phase from a population of genotypes has received a lot of attention recently. This is partly due to the observation that there are many regions on human genomic DNA where genetic recombination is rare (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). A Haplotype Map project has been announced by NIH to identify and characterize populations in terms of these haplotypes. Recently, Gusfield introduced the perfect phylogeny haplotyping problem, as an algorithmic implication of the no-recombination in long blocks observation, together with the standard population-genetic assumption of infinite sites. Gusfield's solution based on matroid theory was followed by direct theta(nm2) solutions that use simpler techniques (Bafna et al., 2003; Eskin et al., 2003), and also bound the number of solutions to the PPH problem. In this short note, we address two questions that were left open. First, can the algorithms of Bafna et al. (2003) and Eskin et al. (2003) be sped-up to O(nm + m2) time, which would imply an O(nm) time-bound for the PPH problem? Second, if there are multiple solutions, can we find one that is most parsimonious in terms of the number of distinct haplotypes. We give reductions that suggests that the answer to both questions is "no." For the first problem, we show that computing the output of the first step (in either method) is equivalent to Boolean matrix multiplication. Therefore, the best bound we can presently achieve is O(nm(omega-1)), where omega < or = 2.52 is the exponent of matrix multiplication. Thus, any linear time solution to the PPH problem likely requires a different approach. For the second problem of computing a PPH solution that minimizes the number of distinct haplotypes, we show that the problem is NP-hard using a reduction from Vertex Cover (Garey and Johnson, 1979).

Subject(s)

Computational Biology , Haplotypes/genetics , Phylogeny , Data Interpretation, Statistical , Genotype

17.

On the complexity of fundamental computational problems in pedigree analysis.

Piccolboni, Antonio; Gusfield, Dan.

J Comput Biol ; 10(5): 763-73, 2003.

Article in English | MEDLINE | ID: mdl-14633398

ABSTRACT

Pedigree analysis is a central component of many current efforts to locate genes that contribute to diseases or to valuable traits. The analysis usually involves solving one of two very computation-intense problems. We analyze the complexity of these two problems. Surprisingly, we show that both problems are NP-hard even for pedigrees that contain no inbreeding loops.

Subject(s)

Pedigree , Computational Biology/methods , Models, Genetic , Probability

18.

Haplotyping as perfect phylogeny: a direct approach.

Bafna, Vineet; Gusfield, Dan; Lancia, Giuseppe; Yooseph, Shibu.

J Comput Biol ; 10(3-4): 323-40, 2003.

Article in English | MEDLINE | ID: mdl-12935331

ABSTRACT

A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). In this paper we explore the algorithmic implications of the no-recombination in long blocks observation, for the problem of inferring haplotypes in populations. This assumption, together with the standard population-genetic assumption of infinite sites, motivates a model of haplotype evolution where the haplotypes in a population are assumed to evolve along a coalescent, which as a rooted tree is a perfect phylogeny. We consider the following algorithmic problem, called the perfect phylogeny haplotyping problem (PPH), which was introduced by Gusfield (2002) - given n genotypes of length m each, does there exist a set of at most 2n haplotypes such that each genotype is generated by a pair of haplotypes from this set, and such that this set can be derived on a perfect phylogeny? The approach taken by Gusfield (2002) to solve this problem reduces it to established, deep results and algorithms from matroid and graph theory. Although that reduction is quite simple and the resulting algorithm nearly optimal in speed, taken as a whole that approach is quite involved, and in particular, challenging to program. Moreover, anyone wishing to fully establish, by reading existing literature, the correctness of the entire algorithm would need to read several deep and difficult papers in graph and matroid theory. However, as stated by Gusfield (2002), many simplifications are possible and the list of "future work" in Gusfield (2002) began with the task of developing a simpler, more direct, yet still efficient algorithm. This paper accomplishes that goal, for both the rooted and unrooted PPH problems. It establishes a simple, easy-to-program, O(nm(2))-time algorithm that determines whether there is a PPH solution for input genotypes and produces a linear-space data structure to represent all of the solutions. The approach allows complete, self-contained proofs. In addition to algorithmic simplicity, the approach here makes the representation of all solutions more intuitive than in Gusfield (2002), and solves another goal from that paper, namely, to prove a nontrivial upper bound on the number of PPH solutions, showing that that number is vastly smaller than the number of haplotype solutions (each solution being a set of n pairs of haplotypes that can generate the genotypes) when the perfect phylogeny requirement is not imposed.

Subject(s)

Haplotypes , Phylogeny , Algorithms , Data Interpretation, Statistical , Genetic Predisposition to Disease

19.

Perfect phylogeny haplotyper: haplotype inferral using a tree model.

Chung, Ren Hua; Gusfield, Dan.

Bioinformatics ; 19(6): 780-1, 2003 Apr 12.

Article in English | MEDLINE | ID: mdl-12691994

ABSTRACT

SUMMARY: We have developed an efficient program, the Perfect Phylogeny Haplotyper (PPH) that takes in unphased population genotype data, and determines if that data can be explained by haplotype pairs that could have evolved on a perfect phylogeny.

Subject(s)

Algorithms , Gene Expression Profiling/methods , Genetic Variation/genetics , Haplotypes/genetics , Models, Genetic , Phylogeny , Software , Internet

20.

Efficient reconstruction of phylogenetic networks with constrained recombination.

Gusfield, Dan; Eddhu, Satish; Langley, Charles.

Proc IEEE Comput Soc Bioinform Conf ; 2: 363-74, 2003.

Article in English | MEDLINE | ID: mdl-16452812

ABSTRACT

A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. With the growth of genomic data, much of which does not fit ideal tree models, there is greater need to understand the algorithmics and combinatorics of phylogenetic networks [10, 11]. However, to date, very little has been published on this, with the notable exception of the paper by Wang et al.[12]. Other related papers include [4, 5, 7] We consider the problem introduced in [12], of determining whether the sequences can be derived on a phylogenetic network where the recombination cycles are node disjoint. In this paper, we call such a phylogenetic network a "galled-tree". By more deeply analysing the combinatorial constraints on cycle-disjoint phylogenetic networks, we obtain an efficient algorithm that is guaranteed to be both a necessary and sufficient test for the existence of a galled-tree for the data. If there is a galled-tree, the algorithm constructs one and obtains an implicit representation of all the galled trees for the data, and can create these in linear time for each one. We also note two additional results related to galled trees: first, any set of sequences that can be derived on a galled tree can be derived on a true tree (without recombination cycles), where at most one back mutation is allowed per site; second, the site compatibility problem (which is NP-hard in general) can be solved in linear time for any set of sequences that can be derived on a galled tree. The combinatorial constraints we develop apply (for the most part) to node-disjoint cycles in any phylogenetic network (not just galled-trees), and can be used for example to prove that a given site cannot be on a node-disjoint cycle in any phylogenetic network. Perhaps more important than the specific results about galled-trees, we introduce an approach that can be used to study recombination in phylogenetic networks that go beyond galled-trees.

Subject(s)

Models, Genetic , Phylogeny , Recombination, Genetic/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Signal Transduction/genetics , Computer Simulation

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL