Search | VHL Regional Portal

Heterogeneous Compression of Large Collections of Evolutionary Trees.

Matthews, Suzanne J.

IEEE/ACM Trans Comput Biol Bioinform ; 12(4): 807-14, 2015.

Article in English | MEDLINE | ID: mdl-26357320

ABSTRACT

Compressing heterogeneous collections of trees is an open problem in computational phylogenetics. In a heterogeneous tree collection, each tree can contain a unique set of taxa. An ideal compression method would allow for the efficient archival of large tree collections and enable scientists to identify common evolutionary relationships over disparate analyses. In this paper, we extend TreeZip to compress heterogeneous collections of trees. TreeZip is the most efficient algorithm for compressing homogeneous tree collections. To the best of our knowledge, no other domain-based compression algorithm exists for large heterogeneous tree collections or enable their rapid analysis. Our experimental results indicate that TreeZip averages 89.03 percent (72.69 percent) space savings on unweighted (weighted) collections of trees when the level of heterogeneity in a collection is moderate. The organization of the TRZ file allows for efficient computations over heterogeneous data. For example, consensus trees can be computed in mere seconds. Lastly, combining the TreeZip compressed (TRZ) file with general-purpose compression yields average space savings of 97.34 percent (81.43 percent) on unweighted (weighted) collections of trees. Our results lead us to believe that TreeZip will prove invaluable in the efficient archival of tree collections, and enables scientists to develop novel methods for relating heterogeneous collections of trees.

Subject(s)

Algorithms , Biological Evolution , Phylogeny , Computational Biology , Data Compression

Geofold: topology-based protein unfolding pathways capture the effects of engineered disulfides on kinetic stability.

Ramakrishnan, Vibin; Srinivasan, Sai Praveen; Salem, Saeed M; Matthews, Suzanne J; Colón, Wilfredo; Zaki, Mohammed; Bystroff, Christopher.

Proteins ; 80(3): 920-34, 2012 Mar.

Article in English | MEDLINE | ID: mdl-22189917

ABSTRACT

Protein unfolding is modeled as an ensemble of pathways, where each step in each pathway is the addition of one topologically possible conformational degree of freedom. Starting with a known protein structure, GeoFold hierarchically partitions (cuts) the native structure into substructures using revolute joints and translations. The energy of each cut and its activation barrier are calculated using buried solvent accessible surface area, side chain entropy, hydrogen bonding, buried cavities, and backbone degrees of freedom. A directed acyclic graph is constructed from the cuts, representing a network of simultaneous equilibria. Finite difference simulations on this graph simulate native unfolding pathways. Experimentally observed changes in the unfolding rates for disulfide mutants of barnase, T4 lysozyme, dihydrofolate reductase, and factor for inversion stimulation were qualitatively reproduced in these simulations. Detailed unfolding pathways for each case explain the effects of changes in the chain topology on the folding energy landscape. GeoFold is a useful tool for the inference of the effects of disulfide engineering on the energy landscape of protein unfolding.

Subject(s)

Disulfides/chemistry , Protein Unfolding , Proteins/chemistry , Software , Bacillus/enzymology , Bacillus/genetics , Bacterial Proteins , Bacteriophage T4/enzymology , Bacteriophage T4/genetics , Entropy , Escherichia coli/enzymology , Escherichia coli/genetics , Kinetics , Models, Molecular , Muramidase/chemistry , Muramidase/genetics , Mutation , Protein Conformation , Protein Stability , Proteins/genetics , Ribonucleases/chemistry , Ribonucleases/genetics , Tetrahydrofolate Dehydrogenase/chemistry , Tetrahydrofolate Dehydrogenase/genetics

An efficient and extensible approach for compressing phylogenetic trees.

Matthews, Suzanne J; Williams, Tiffani L.

BMC Bioinformatics ; 12 Suppl 10: S16, 2011 Oct 18.

Article in English | MEDLINE | ID: mdl-22165819

ABSTRACT

BACKGROUND: Biologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees. Our previous work showed that TreeZip is a promising approach for compressing phylogenetic trees. In this paper, we extend our TreeZip algorithm by handling trees with weighted branches. Furthermore, by using the compressed TreeZip file as input, we have designed an extensible decompressor that can extract subcollections of trees, compute majority and strict consensus trees, and merge tree collections using set operations such as union, intersection, and set difference. RESULTS: On unweighted phylogenetic trees, TreeZip is able to compress Newick files in excess of 98%. On weighted phylogenetic trees, TreeZip is able to compress a Newick file by at least 73%. TreeZip can be combined with 7zip with little overhead, allowing space savings in excess of 99% (unweighted) and 92%(weighted). Unlike TreeZip, 7zip is not immune to branch rotations, and performs worse as the level of variability in the Newick string representation increases. Finally, since the TreeZip compressed text (TRZ) file contains all the semantic information in a collection of trees, we can easily filter and decompress a subset of trees of interest (such as the set of unique trees), or build the resulting consensus tree in a matter of seconds. We also show the ease of which set operations can be performed on TRZ files, at speeds quicker than those performed on Newick or 7zip compressed Newick files, and without loss of space savings. CONCLUSIONS: TreeZip is an efficient approach for compressing large collections of phylogenetic trees. The semantic and compact nature of the TRZ file allow it to be operated upon directly and quickly, without a need to decompress the original Newick file. We believe that TreeZip will be vital for compressing and archiving trees in the biological community.

Subject(s)

Algorithms , Classification/methods , Phylogeny , Animals , Humans , Software

MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees.

Matthews, Suzanne J; Williams, Tiffani L.

BMC Bioinformatics ; 11 Suppl 1: S15, 2010 Jan 18.

Article in English | MEDLINE | ID: mdl-20122186

ABSTRACT

BACKGROUND: MapReduce is a parallel framework that has been used effectively to design large-scale parallel applications for large computing clusters. In this paper, we evaluate the viability of the MapReduce framework for designing phylogenetic applications. The problem of interest is generating the all-to-all Robinson-Foulds distance matrix, which has many applications for visualizing and clustering large collections of evolutionary trees. We introduce MrsRF (MapReduce Speeds up RF), a multi-core algorithm to generate a t x t Robinson-Foulds distance matrix between t trees using the MapReduce paradigm. RESULTS: We studied the performance of our MrsRF algorithm on two large biological trees sets consisting of 20,000 trees of 150 taxa each and 33,306 trees of 567 taxa each. Our experiments show that MrsRF is a scalable approach reaching a speedup of over 18 on 32 total cores. Our results also show that achieving top speedup on a multi-core cluster requires different cluster configurations. Finally, we show how to use an RF matrix to summarize collections of phylogenetic trees visually. CONCLUSION: Our results show that MapReduce is a promising paradigm for developing multi-core phylogenetic applications. The results also demonstrate that different multi-core configurations must be tested in order to obtain optimum performance. We conclude that RF matrices play a critical role in developing techniques to summarize large collections of trees.

Subject(s)

Algorithms , Evolution, Molecular , Phylogeny , Databases, Genetic , Software

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL