Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
Add more filters










Publication year range
1.
bioRxiv ; 2024 May 29.
Article in English | MEDLINE | ID: mdl-38853926

ABSTRACT

All eukaryotes share a common ancestor from roughly 1.5 - 1.8 billion years ago, a single-celled, swimming microbe known as LECA, the Last Eukaryotic Common Ancestor. Nearly half of the genes in modern eukaryotes were present in LECA, and many current genetic diseases and traits stem from these ancient molecular systems. To better understand these systems, we compared genes across modern organisms and identified a core set of 10,092 shared protein-coding gene families likely present in LECA, a quarter of which are uncharacterized. We then integrated >26,000 mass spectrometry proteomics analyses from 31 species to infer how these proteins interact in higher-order complexes. The resulting interactome describes the biochemical organization of LECA, revealing both known and new assemblies. We analyzed these ancient protein interactions to find new human gene-disease relationships for bone density and congenital birth defects, demonstrating the value of ancestral protein interactions for guiding functional genetics today.

2.
J Cell Biol ; 222(7)2023 07 03.
Article in English | MEDLINE | ID: mdl-37102997

ABSTRACT

Homotypic membrane fusion catalyzed by the atlastin (ATL) GTPase sustains the branched endoplasmic reticulum (ER) network in metazoans. Our recent discovery that two of the three human ATL paralogs (ATL1/2) are C-terminally autoinhibited implied that relief of autoinhibition would be integral to the ATL fusion mechanism. An alternative hypothesis is that the third paralog ATL3 promotes constitutive ER fusion with relief of ATL1/2 autoinhibition used conditionally. However, published studies suggest ATL3 is a weak fusogen at best. Contrary to expectations, we demonstrate here that purified human ATL3 catalyzes efficient membrane fusion in vitro and is sufficient to sustain the ER network in triple knockout cells. Strikingly, ATL3 lacks any detectable C-terminal autoinhibition, like the invertebrate Drosophila ATL ortholog. Phylogenetic analysis of ATL C-termini indicates that C-terminal autoinhibition is a recent evolutionary innovation. We suggest that ATL3 is a constitutive ER fusion catalyst and that ATL1/2 autoinhibition likely evolved in vertebrates as a means of upregulating ER fusion activity on demand.


Subject(s)
GTP Phosphohydrolases , Membrane Fusion , Animals , Humans , Drosophila , GTP Phosphohydrolases/genetics , Phylogeny
3.
Bioinformatics ; 38(Suppl 1): i134-i142, 2022 06 24.
Article in English | MEDLINE | ID: mdl-35758772

ABSTRACT

MOTIVATION: Simulation is an essential technique for generating biomolecular data with a 'known' history for use in validating phylogenetic inference and other evolutionary methods. On longer time scales, simulation supports investigations of equilibrium behavior and provides a formal framework for testing competing evolutionary hypotheses. Twenty years of molecular evolution research have produced a rich repertoire of simulation methods. However, current models do not capture the stringent constraints acting on the domain insertions, duplications, and deletions by which multidomain architectures evolve. Although these processes have the potential to generate any combination of domains, only a tiny fraction of possible domain combinations are observed in nature. Modeling these stringent constraints on domain order and co-occurrence is a fundamental challenge in domain architecture simulation that does not arise with sequence and gene family simulation. RESULTS: Here, we introduce a stochastic model of domain architecture evolution to simulate evolutionary trajectories that reflect the constraints on domain order and co-occurrence observed in nature. This framework is implemented in a novel domain architecture simulator, DomArchov, using the Metropolis-Hastings algorithm with data-driven transition probabilities. The use of a data-driven event module enables quick and easy redeployment of the simulator for use in different taxonomic and protein function contexts. Using empirical evaluation with metazoan datasets, we demonstrate that domain architectures simulated by DomArchov recapitulate properties of genuine domain architectures that reflect the constraints on domain order and adjacency seen in nature. This work expands the realm of evolutionary processes that are amenable to simulation. AVAILABILITY AND IMPLEMENTATION: DomArchov is written in Python 3 and is available at http://www.cs.cmu.edu/~durand/DomArchov. The data underlying this article are available via the same link. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Evolution, Molecular , Proteins , Algorithms , Animals , Computer Simulation , Phylogeny , Proteins/genetics
4.
J Bioinform Comput Biol ; 19(6): 2140013, 2021 12.
Article in English | MEDLINE | ID: mdl-34806953

ABSTRACT

The exon shuffling theory posits that intronic recombination creates new domain combinations, facilitating the evolution of novel protein function. This theory predicts that introns will be preferentially situated near domain boundaries. Many studies have sought evidence for exon shuffling by testing the correspondence between introns and domain boundaries against chance intron positioning. Here, we present an empirical investigation of how the choice of null model influences significance. Although genome-wide studies have used a uniform null model, exclusively, more realistic null models have been proposed for single gene studies. We extended these models for genome-wide analyses and applied them to 21 metazoan and fungal genomes. Our results show that compared with the other two models, the uniform model does not recapitulate genuine exon lengths, dramatically underestimates the probability of chance agreement, and overestimates the significance of intron-domain correspondence by as much as 100 orders of magnitude. Model choice had much greater impact on the assessment of exon shuffling in fungal genomes than in metazoa, leading to different evolutionary conclusions in seven of the 16 fungal genomes tested. Genome-wide studies that use this overly permissive null model may exaggerate the importance of exon shuffling as a general mechanism of multidomain evolution.


Subject(s)
Genome-Wide Association Study , Genome , Animals , Evolution, Molecular , Exons , Introns , Proteins
5.
PLoS Genet ; 14(9): e1007470, 2018 09.
Article in English | MEDLINE | ID: mdl-30212463

ABSTRACT

The evolution of signal transduction pathways is constrained by the requirements of signal fidelity, yet flexibility is necessary to allow pathway remodeling in response to environmental challenges. A detailed understanding of how flexibility and constraint shape bacterial two component signaling systems is emerging, but how new signal transduction architectures arise remains unclear. Here, we investigate pathway remodeling using the Firmicute sporulation initiation (Spo0) pathway as a model. The present-day Spo0 pathways in Bacilli and Clostridia share common ancestry, but possess different architectures. In Clostridium acetobutylicum, sensor kinases directly phosphorylate Spo0A, the master regulator of sporulation. In Bacillus subtilis, Spo0A is activated via a four-protein phosphorelay. The current view favors an ancestral direct phosphorylation architecture, with the phosphorelay emerging in the Bacillar lineage. Our results reject this hypothesis. Our analysis of 84 broadly distributed Firmicute genomes predicts phosphorelays in numerous Clostridia, contrary to the expectation that the Spo0 phosphorelay is unique to Bacilli. Our experimental verification of a functional Spo0 phosphorelay encoded by Desulfotomaculum acetoxidans (Class Clostridia) further supports functional phosphorelays in Clostridia, which strongly suggests that the ancestral Spo0 pathway was a phosphorelay. Cross complementation assays between Bacillar and Clostridial phosphorelays demonstrate conservation of interaction specificity since their divergence over 2.7 BYA. Further, the distribution of direct phosphorylation Spo0 pathways is patchy, suggesting multiple, independent instances of remodeling from phosphorelay to direct phosphorylation. We provide evidence that these transitions are likely the result of changes in sporulation kinase specificity or acquisition of a sensor kinase with specificity for Spo0A, which is remarkably conserved in both architectures. We conclude that flexible encoding of interaction specificity, a phenotype that is only intermittently essential, and the recruitment of kinases to recognize novel environmental signals resulted in a consistent and repeated pattern of remodeling of the Spo0 pathway.


Subject(s)
Bacterial Proteins/genetics , Evolution, Molecular , Firmicutes/physiology , Signal Transduction/genetics , Spores, Bacterial/metabolism , Bacterial Proteins/metabolism , Gene Expression Regulation, Bacterial , Histidine Kinase/genetics , Histidine Kinase/metabolism , Phosphorylation/physiology , Phylogeny
6.
Nucleic Acids Res ; 46(5): 2265-2278, 2018 03 16.
Article in English | MEDLINE | ID: mdl-29432573

ABSTRACT

Highly Iterated Palindrome 1 (HIP1, GCGATCGC) is hyper-abundant in most cyanobacterial genomes. In some cyanobacteria, average HIP1 abundance exceeds one motif per gene. Such high abundance suggests a significant role in cyanobacterial biology. However, 20 years of study have not revealed whether HIP1 has a function, much less what that function might be. We show that HIP1 is 15- to 300-fold over-represented in genomes analyzed. More importantly, HIP1 sites are conserved both within and between open reading frames, suggesting that their overabundance is maintained by selection rather than by continual replenishment by neutral processes, such as biased DNA repair. This evidence for selection suggests a functional role for HIP1. No evidence was found to support a functional role as a peptide or RNA motif or a role in the regulation of gene expression. Rather, we demonstrate that the distribution of HIP1 along cyanobacterial chromosomes is significantly periodic, with periods ranging from 10 to 90 kb, consistent in scale with periodicities reported for co-regulated, co-expressed and evolutionarily correlated genes. The periodicity we observe is also comparable in scale to chromosomal interaction domains previously described in other bacteria. In this context, our findings imply HIP1 functions associated with chromosome and nucleoid structure.


Subject(s)
Bacterial Proteins/genetics , Cyanobacteria/genetics , Genome, Bacterial/genetics , Selection, Genetic , Bacterial Proteins/metabolism , Base Sequence , Chromosomes, Bacterial/genetics , Cyanobacteria/classification , Cyanobacteria/metabolism , DNA, Bacterial/genetics , Gene Expression Regulation, Bacterial , Periodicity , Phylogeny
7.
mSphere ; 2(5)2017.
Article in English | MEDLINE | ID: mdl-29085912

ABSTRACT

Streptococcus pneumoniae (pneumococcus) displays broad tissue tropism and infects multiple body sites in the human host. However, infections of the conjunctiva are limited to strains within a distinct phyletic group with multilocus sequence types ST448, ST344, ST1186, ST1270, and ST2315. In this study, we sequenced the genomes of six pneumococcal strains isolated from eye infections. The conjunctivitis isolates are grouped in a distinct phyletic group together with a subset of nasopharyngeal isolates. The keratitis (infection of the cornea) and endophthalmitis (infection of the vitreous body) isolates are grouped with the remainder of pneumococcal strains. Phenotypic characterization is consistent with morphological differences associated with the distinct phyletic group. Specifically, isolates from the distinct phyletic group form aggregates in planktonic cultures and chain-like structures in biofilms grown on abiotic surfaces. To begin to investigate the association between genotype and epidemiology, we focused on a predicted surface-exposed adhesin (SspB) encoded exclusively by this distinct phyletic group. Phylogenetic analysis of the gene encoding SspB in the context of a streptococcal species tree suggests that sspB was acquired by lateral gene transfer from Streptococcus suis. Furthermore, an sspB deletion mutant displays decreased adherence to cultured cells from the ocular epithelium compared to the isogenic wild-type and complemented strains. Together these findings suggest that acquisition of genes from outside the species has contributed to pneumococcal tissue tropism by enhancing the ability of a subset of strains to infect the ocular epithelium causing conjunctivitis. IMPORTANCE Changes in the gene content of pathogens can modify their ability to colonize and/or survive in different body sites in the human host. In this study, we investigate a gene acquisition event and its role in the pathogenesis of Streptococccus pneumoniae (pneumococcus). Our findings suggest that the gene encoding the predicted surface protein SspB has been transferred from Streptococcus suis (a distantly related streptococcal species) into a distinct set of pneumococcal strains. This group of strains distinguishes itself from the remainder of pneumococcal strains by extensive differences in genomic composition and by the ability to cause conjunctivitis. We find that the presence of sspB increases adherence of pneumococcus to the ocular epithelium. Thus, our data support the hypothesis that a subset of pneumococcal strains has gained genes from neighboring species that enhance their ability to colonize the epithelium of the eye, thus expanding into a new niche.

8.
Bioinformatics ; 33(5): 640-649, 2017 03 01.
Article in English | MEDLINE | ID: mdl-27998934

ABSTRACT

Motivation: Orthology analysis is a fundamental tool in comparative genomics. Sophisticated methods have been developed to distinguish between orthologs and paralogs and to classify paralogs into subtypes depending on the duplication mechanism and timing, relative to speciation. However, no comparable framework exists for xenologs: gene pairs whose history, since their divergence, includes a horizontal transfer. Further, the diversity of gene pairs that meet this broad definition calls for classification of xenologs with similar properties into subtypes. Results: We present a xenolog classification that uses phylogenetic reconciliation to assign each pair of genes to a class based on the event responsible for their divergence and the historical association between genes and species. Our classes distinguish between genes related through transfer alone and genes related through duplication and transfer. Further, they separate closely-related genes in distantly-related species from distantly-related genes in closely-related species. We present formal rules that assign gene pairs to specific xenolog classes, given a reconciled gene tree with an arbitrary number of duplications and transfers. These xenology classification rules have been implemented in software and tested on a collection of ∼13 000 prokaryotic gene families. In addition, we present a case study demonstrating the connection between xenolog classification and gene function prediction. Availability and Implementation: The xenolog classification rules have been implemented in N otung 2.9, a freely available phylogenetic reconciliation software package. http://www.cs.cmu.edu/~durand/Notung . Gene trees are available at http://dx.doi.org/10.7488/ds/1503 . Contact: durand@cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Genes, Bacterial , Genomics/methods , Phylogeny , Software , Algorithms , Bacteria/genetics , Evolution, Molecular , Sequence Homology, Nucleic Acid
9.
BMC Bioinformatics ; 16 Suppl 14: S8, 2015.
Article in English | MEDLINE | ID: mdl-26451642

ABSTRACT

BACKGROUND: Reconstructing evolution provides valuable insights into the processes of gene evolution and function. However, while there have been great advances in algorithms and software to reconstruct the history of gene families, these tools do not model the domain shuffling events (domain duplication, insertion, transfer, and deletion) that drive the evolution of multidomain protein families. Protein evolution through domain shuffling events allows for rapid exploration of functions by introducing new combinations of existing folds. This powerful mechanism was key to some significant evolutionary innovations, such as multicellularity and the vertebrate immune system. A method for reconstructing this important evolutionary process is urgently needed. RESULTS: Here, we introduce a novel, event-based framework for studying multidomain evolution by reconciling a domain tree with a gene tree, with additional information provided by the species tree. In the context of this framework, we present the first reconciliation algorithms to infer domain shuffling events, while addressing the challenges inherent in the inference of evolution across three levels of organization. CONCLUSIONS: We apply these methods to the evolution of domains in the Membrane associated Guanylate Kinase family. These case studies reveal a more vivid and detailed evolutionary history than previously provided. Our algorithms have been implemented in software, freely available at http://www.cs.cmu.edu/˜durand/Notung.


Subject(s)
Algorithms , Evolution, Molecular , Guanylate Kinases/genetics , Multigene Family , Phylogeny , Software , Animals , Gene Duplication , Protein Structure, Tertiary , Vertebrates/genetics
10.
BMC Genomics ; 15 Suppl 6: S9, 2014.
Article in English | MEDLINE | ID: mdl-25572914

ABSTRACT

BACKGROUND: Phylogenetic birth-death models are opening a new window on the processes of genome evolution in studies of the evolution of gene and protein families, protein-protein interaction networks, microRNAs, and copy number variation. Given a species tree and a set of genomic characters in present-day species, the birth-death approach estimates the most likely rates required to explain the observed data and returns the expected ancestral character states and the history of character state changes. Achieving a balance between model complexity and generalizability is a fundamental challenge in the application of birth-death models. While more parameters promise greater accuracy and more biologically realistic models, increasing model complexity can lead to overfitting and a heavy computational cost. RESULTS: Here we present a systematic, empirical investigation of these tradeoffs, using protein domain families in six metazoan genomes as a case study. We compared models of increasing complexity, implemented in the Count program, with respect to model fit, robustness, and stability. In addition, we used a bootstrapping procedure to assess estimator variability. The results show that the most complex model, which allows for both branch-specific and family-specific rate variation, achieves the best fit, without overfitting. Variance remains low with increasing complexity, except for family-specific loss rates. This variance is reduced when the number of discrete rate categories is increased. CONCLUSIONS: The work presented here evaluates model choice for genomic birth-death models in a systematic way and presents the first use of bootstrapping to assess estimator variance in birth-death models. We find that a model incorporating both lineage and family rate variation yields more accurate estimators without sacrificing generality. Our results indicate that model choice can lead to fundamentally different evolutionary conclusions, emphasizing the importance of more biologically realistic and complex models.


Subject(s)
Evolution, Molecular , Genome , Genomics/methods , Models, Genetic , Phylogeny
11.
Trends Genet ; 29(11): 659-68, 2013 Nov.
Article in English | MEDLINE | ID: mdl-23915718

ABSTRACT

Gene functions, interactions, disease associations, and ecological distributions are all correlated with gene age. However, it is challenging to estimate the intricate series of evolutionary events leading to a modern-day gene and then to reduce this history to a single age estimate. Focusing on eukaryotic gene families, we introduce a framework that can be used to compare current strategies for quantifying gene age, discuss key differences between these methods, and highlight several common problems. We argue that genes with complex evolutionary histories do not have a single well-defined age. As a result, care must be taken to articulate the goals and assumptions of any analysis that uses gene age estimates. Recent algorithmic advances offer the promise of gene age estimates that are fast, accurate, and consistent across gene families. This will enable a shift to integrated genome-wide analyses of all events in gene evolutionary histories in the near future.


Subject(s)
Evolution, Molecular , Genes/physiology , Models, Genetic , Paleontology/methods , Computational Biology , Databases, Genetic , Phylogeny
12.
Bioinformatics ; 28(18): i409-i415, 2012 Sep 15.
Article in English | MEDLINE | ID: mdl-22962460

ABSTRACT

MOTIVATION: Gene duplication (D), transfer (T), loss (L) and incomplete lineage sorting (I) are crucial to the evolution of gene families and the emergence of novel functions. The history of these events can be inferred via comparison of gene and species trees, a process called reconciliation, yet current reconciliation algorithms model only a subset of these evolutionary processes. RESULTS: We present an algorithm to reconcile a binary gene tree with a nonbinary species tree under a DTLI parsimony criterion. This is the first reconciliation algorithm to capture all four evolutionary processes driving tree incongruence and the first to reconcile non-binary species trees with a transfer model. Our algorithm infers all optimal solutions and reports complete, temporally feasible event histories, giving the gene and species lineages in which each event occurred. It is fixed-parameter tractable, with polytime complexity when the maximum species outdegree is fixed. Application of our algorithms to prokaryotic and eukaryotic data show that use of an incomplete event model has substantial impact on the events inferred and resulting biological conclusions. AVAILABILITY: Our algorithms have been implemented in Notung, a freely available phylogenetic reconciliation software package, available at http://www.cs.cmu.edu/~durand/Notung. CONTACT: mstolzer@andrew.cmu.edu.


Subject(s)
Algorithms , Evolution, Molecular , Multigene Family , Gene Duplication , Gene Transfer, Horizontal , Models, Genetic , Phylogeny , Software
13.
Curr Protoc Bioinformatics ; Chapter 6: Unit 6.11, 2011 Mar.
Article in English | MEDLINE | ID: mdl-21400696

ABSTRACT

Inferring a protein's function by homology is a powerful tool for biologists. The Princeton Protein Orthology Database (P-POD) offers a simple way to visualize and analyze the relationships between homologous proteins in order to infer function. P-POD contains computationally generated analysis distinguishing orthologs from paralogs combined with curated published information on functional complementation and on human diseases. P-POD also features an applet, Notung, for users to explore and modify phylogenetic trees and generate their own ortholog/paralogs calls. This unit describes how to search P-POD for precomputed data, how to find and use the associated curated information from the literature, and how to use Notung to analyze and refine the results.


Subject(s)
Databases, Protein , Genomics/methods , Proteins/chemistry , Sequence Homology, Amino Acid , Evolution, Molecular , Phylogeny , Proteins/classification , Proteins/metabolism , Sequence Alignment , Sequence Analysis, Protein
14.
Bioinformatics ; 25(12): i45-53, 2009 Jun 15.
Article in English | MEDLINE | ID: mdl-19478015

ABSTRACT

MOTIVATION: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms. RESULTS: Here, we investigate a network-rewiring strategy designed to eliminate edges due to promiscuous domains. We show that this strategy can reduce noise in and restore structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluate this approach on a hand-curated set of multidomain sequences in mouse and human, and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with current methods. Families in our test set exhibit a broad range of domain architectures and sequence conservation, demonstrating that our method is flexible, robust and suitable for high-throughput, automated processing of heterogeneous, genome-scale data.


Subject(s)
Computational Biology/methods , Proteins/classification , Sequence Analysis, Protein/methods , Sequence Homology, Amino Acid , Animals , Conserved Sequence , Databases, Protein , Genome , Humans , Mice , Proteins/chemistry , Proteins/genetics
15.
Mol Biol Evol ; 26(5): 957-68, 2009 May.
Article in English | MEDLINE | ID: mdl-19150803

ABSTRACT

Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In distantly related genomes, clusters of homologous gene pairs are evidence of candidate homologous regions. Demonstrating the statistical significance of such "gene clusters" is an essential component of comparative genomic analyses. However, currently there are no practical statistical tests for gene clusters that model the influence of the number of homologs in each gene family on cluster significance. In this work, we demonstrate empirically that failure to incorporate gene family size in gene cluster statistics results in overestimation of significance, leading to incorrect conclusions. We further present novel analytical methods for estimating gene cluster significance that take gene family size into account. Our methods do not require complete genome data and are suitable for testing individual clusters found in local regions, such as contigs in an unfinished assembly. We consider pairs of regions drawn from the same genome (paralogous clusters), as well as regions drawn from two different genomes (orthologous clusters). Determining cluster significance under general models of gene family size is computationally intractable. By assuming that all gene families are of equal size, we obtain analytical expressions that allow fast approximation of cluster probabilities. We evaluate the accuracy of this approximation by comparing the resulting gene cluster probabilities with cluster probabilities obtained by simulating a realistic, power-law distributed model of gene family size, with parameters inferred from genomic data. Surprisingly, despite the simplicity of the underlying assumption, our method accurately approximates the true cluster probabilities. It slightly overestimates these probabilities, yielding a conservative test. We present additional simulation results indicating the best choice of parameter values for data analysis in genomes of various sizes and illustrate the utility of our methods by applying them to gene clusters recently reported in the literature. Mathematical code to compute cluster probabilities using our methods is available as supplementary material.


Subject(s)
Models, Statistical , Multigene Family/genetics , Arabidopsis/genetics , Evolution, Molecular , Genes, Plant , Genome/genetics , Raphanus/genetics , Sequence Homology, Nucleic Acid
17.
J Comput Biol ; 15(8): 981-1006, 2008 Oct.
Article in English | MEDLINE | ID: mdl-18808330

ABSTRACT

Reconciliation extracts information from the topological incongruence between gene and species trees to infer duplications and losses in the history of a gene family. The inferred duplication-loss histories provide valuable information for a broad range of biological applications, including ortholog identification, estimating gene duplication times, and rooting and correcting gene trees. While reconciliation for binary trees is a tractable and well studied problem, there are no algorithms for reconciliation with non-binary species trees. Yet a striking proportion of species trees are non-binary. For example, 64% of branch points in the NCBI taxonomy have three or more children. When applied to non-binary species trees, current algorithms overestimate the number of duplications because they cannot distinguish between duplication and incomplete lineage sorting. We present the first algorithms for reconciling binary gene trees with non-binary species trees under a duplication-loss parsimony model. Our algorithms utilize an efficient mapping from gene to species trees to infer the minimum number of duplications in O(|V(G) | x (k(S) + h(S))) time, where |V(G)| is the number of nodes in the gene tree, h(S) is the height of the species tree and k(S) is the size of its largest polytomy. We present a dynamic programming algorithm which also minimizes the total number of losses. Although this algorithm is exponential in the size of the largest polytomy, it performs well in practice for polytomies with outdegree of 12 or less. We also present a heuristic which estimates the minimal number of losses in polynomial time. In empirical tests, this algorithm finds an optimal loss history 99% of the time. Our algorithms have been implemented in NOTUNG, a robust, production quality, tree-fitting program, which provides a graphical user interface for exploratory analysis and also supports automated, high-throughput analysis of large data sets.


Subject(s)
Algorithms , Gene Duplication , Models, Genetic , Phylogeny , Evolution, Molecular , Genetic Speciation , Multigene Family , Software
18.
PLoS Comput Biol ; 4(4): e1000063, 2008 May 16.
Article in English | MEDLINE | ID: mdl-18475320

ABSTRACT

We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era.


Subject(s)
Phylogeny , Proteins/chemistry , Proteins/genetics , Sequence Homology, Amino Acid , Amino Acid Sequence , Conserved Sequence , Family , Female , Genomics , Humans , Male , Receptors, Platelet-Derived Growth Factor/chemistry , Receptors, Platelet-Derived Growth Factor/genetics
19.
J Bioinform Comput Biol ; 6(1): 1-22, 2008 Feb.
Article in English | MEDLINE | ID: mdl-18324742

ABSTRACT

Gene clusters that span three or more chromosomal regions are of increasing importance, yet statistical tests to validate such clusters are in their infancy. Current approaches either conduct several pairwise comparisons or consider only the number of genes that occur in all of the regions. In this paper, we provide statistical tests for clusters spanning exactly three regions based on genome models of typical comparative genomics problems, including analysis of conserved linkage within multiple species and identification of large-scale duplications. Our tests are the first to combine evidence from genes shared among all three regions and genes shared between pairs of regions. We show that our tests of clusters spanning three regions are more sensitive than existing approaches, and can thus be used to identify more diverged homologous regions.


Subject(s)
Algorithms , Chromosome Mapping/methods , Data Interpretation, Statistical , Multigene Family/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Base Sequence , Molecular Sequence Data , Reproducibility of Results , Sensitivity and Specificity
20.
J Comput Biol ; 13(2): 320-35, 2006 Mar.
Article in English | MEDLINE | ID: mdl-16597243

ABSTRACT

Gene family evolution is determined by microevolutionary processes (e.g., point mutations) and macroevolutionary processes (e.g., gene duplication and loss), yet macroevolutionary considerations are rarely incorporated into gene phylogeny reconstruction methods. We present a dynamic program to find the most parsimonious gene family tree with respect to a macroevolutionary optimization criterion, the weighted sum of the number of gene duplications and losses. The existence of a polynomial delay algorithm for duplication/loss phylogeny reconstruction stands in contrast to most formulations of phylogeny reconstruction, which are NP-complete. We next extend this result to obtain a two-phase method for gene tree reconstruction that takes both micro- and macroevolution into account. In the first phase, a gene tree is constructed from sequence data, using any of the previously known algorithms for gene phylogeny construction. In the second phase, the tree is refined by rearranging regions of the tree that do not have strong support in the sequence data to minimize the duplication/lost cost. Components of the tree with strong support are left intact. This hybrid approach incorporates both micro- and macroevolutionary considerations, yet its computational requirements are modest in practice because the two-phase approach constrains the search space. Our hybrid algorithm can also be used to resolve nonbinary nodes in a multifurcating gene tree. We have implemented these algorithms in a software tool, NOTUNG 2.0, that can be used as a unified framework for gene tree reconstruction or as an exploratory analysis tool that can be applied post hoc to any rooted tree with bootstrap values. The NOTUNG 2.0 graphical user interface can be used to visualize alternate duplication/loss histories, root trees according to duplication and loss parsimony, manipulate and annotate gene trees, and estimate gene duplication times. It also offers a command line option that enables high-throughput analysis of a large number of trees.


Subject(s)
Algorithms , Evolution, Molecular , Gene Deletion , Gene Duplication , Multigene Family , Mutation , ATP-Binding Cassette Transporters/chemistry , ATP-Binding Cassette Transporters/genetics , Computational Biology , Glutathione Transferase/chemistry , Glutathione Transferase/genetics , Models, Genetic , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...