Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
Add more filters










Publication year range
1.
bioRxiv ; 2024 Mar 13.
Article in English | MEDLINE | ID: mdl-38559261

ABSTRACT

Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.

2.
Syst Biol ; 2024 Feb 08.
Article in English | MEDLINE | ID: mdl-38330161

ABSTRACT

The evolution of gene families is complex, involving gene-level evolutionary events such as gene duplication, horizontal gene transfer, and gene loss (DTL), and other processes such as incomplete lineage sorting (ILS). Because of this, topological differences often exist between gene trees and species trees. A number of models have been recently developed to explain these discrepancies, the most realistic of which attempt to consider both gene-level events and ILS. When unified in a single model, the interaction between ILS and gene-level events can cause polymorphism in gene copy number, which we refer to as copy number hemiplasy (CNH). In this paper we extend the Wright-Fisher process to include duplications and losses over several species, and show that the probability of CNH for this process can be significant. We study how well two unified models - MLMSC (MultiLocus MultiSpecies Coalescent), which models CNH, and DLCoal (Duplication, Loss, and Coalescence), which does not - approximate the Wright-Fisher process with duplication and loss. We then study the effect of CNH on gene family evolution by comparing MLMSC and DLCoal. We generate comparable gene trees under both models, showing significant differences in various summary statistics; most importantly, CNH reduces the number of gene copies greatly. If this is not taken into account, the traditional method of estimating duplication rates (by counting the number of gene copies) becomes inaccurate. The simulated gene trees are also used for species tree inference with the summary methods ASTRAL and ASTRAL-Pro, demonstrating that their accuracy, based on CNH-unaware simulations calibrated on real data, may have been overestimated.

3.
bioRxiv ; 2023 Nov 07.
Article in English | MEDLINE | ID: mdl-37986738

ABSTRACT

The var multigene family encodes the P. falciparum erythrocyte membrane protein 1 (PfEMP1), which is important in host-parasite interaction as a virulence factor and major surface antigen of the blood stages of the parasite, responsible for maintaining chronic infection. Whilst important in the biology of P. falciparum, these genes (50 to 60 genes per parasite genome) are routinely excluded from whole genome analyses due to their hyper-diversity, achieved primarily through recombination. The PfEMP1 head structure almost always consists of a DBLα-CIDR tandem. Categorised into different groups (upsA, upsB, upsC), different head structures have been associated with different ligand-binding affinities and disease severities. We study how conserved individual DBLα types are at the country, regional, and local scales in Sub-Saharan Africa. Using publicly-available sequence datasets and a novel ups classification algorithm, cUps, we performed an in silico exploration of DBLα conservation through time and space in Africa. In all three ups groups, the population structure of DBLα types in Africa consists of variants occurring at rare, low, moderate, and high frequencies. Non-rare variants were found to be temporally stable in a local area in endemic Ghana. When inspected across different geographical scales, we report different levels of conservation; while some DBLα types were consistently found in high frequencies in multiple African countries, others were conserved only locally, signifying local preservation of specific types. Underlying this population pattern is the composition of DBLα types within each isolate DBLα repertoire, revealed to also consist of a mix of types found at rare, low, moderate, and high frequencies in the population. We further discuss the adaptive forces and balancing selection, including host genetic factors, potentially shaping the evolution and diversity of DBLα types in Africa.

4.
J Math Biol ; 85(3): 22, 2022 08 17.
Article in English | MEDLINE | ID: mdl-35976512

ABSTRACT

methods seek to infer a species tree from a set of gene trees. A desirable property of such methods is that of statistical consistency; that is, the probability of inferring the wrong species tree (the error probability) tends to 0 as the number of input gene trees becomes large. A popular paradigm is to infer a species tree that agrees with the maximum number of quartets from the input set of gene trees; this has been proved to be statistically consistent under several models of gene evolution. In this paper, we study the asymptotic behaviour of the error probability of such methods in this limit, and show that it decays exponentially. For a 4-taxon species tree, we derive a closed form for the asymptotic behaviour in terms of the probability that the gene evolution process produces the correct topology. We also derive bounds for the sample complexity (the number of gene trees required to infer the true species tree with a given probability), which outperform existing bounds. We then extend our results to bounds for the asymptotic behaviour of the error probability for any species tree, and compare these to the true error probability for some model species trees using simulations.


Subject(s)
Evolution, Molecular , Models, Genetic , Genetic Speciation , Phylogeny , Probability
5.
PLoS Comput Biol ; 18(3): e1009960, 2022 03.
Article in English | MEDLINE | ID: mdl-35263345

ABSTRACT

We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.


Subject(s)
Models, Genetic , Software , Algorithms , Bayes Theorem , Markov Chains , Phylogeny , Recombination, Genetic/genetics
6.
Bioinformatics ; 38(7): 1823-1829, 2022 03 28.
Article in English | MEDLINE | ID: mdl-35025988

ABSTRACT

MOTIVATION: Recombination is a fundamental process in molecular evolution, and the identification of recombinant sequences is thus of major interest. However, current methods for detecting recombinants are primarily designed for aligned sequences. Thus, they struggle with analyses of highly diverse genes, such as the var genes of the malaria parasite Plasmodium falciparum, which are known to diversify primarily through recombination. RESULTS: We introduce an algorithm to detect recent recombinant sequences from a dataset without a full multiple alignment. Our algorithm can handle thousands of gene-length sequences without the need for a reference panel. We demonstrate the accuracy of our algorithm through extensive numerical simulations; in particular, it maintains its effectiveness in the presence of insertions and deletions. We apply our algorithm to a dataset of 17 335 DBLα types in var genes from Ghana, observing that sequences belonging to the same ups group or domain subclass recombine amongst themselves more frequently, and that non-recombinant DBLα types are more conserved than recombinant ones. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://github.com/qianfeng2/detREC_program. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genetic Variation , Protozoan Proteins , Protozoan Proteins/genetics , Plasmodium falciparum/genetics , Software , Evolution, Molecular
7.
Article in English | MEDLINE | ID: mdl-36998722

ABSTRACT

The enormous diversity and complexity of var genes that diversify rapidly by recombination has led to the exclusion of assembly of these genes from major genome initiatives (e.g., Pf6). A scalable solution in epidemiological surveillance of var genes is to use a small 'tag' region encoding the immunogenic DBLα domain as a marker to estimate var diversity. As var genes diversify by recombination, it is not clear the extent to which the same tag can appear in multiple var genes. This relationship between marker and gene has not been investigated in natural populations. Analyses of in vitro recombination within and between var genes have suggested that this relationship would not be exclusive. Using a dataset of publicly-available assembled var sequences, we test this hypothesis by studying DBLα-var relationships for four study sites in four countries: Pursat (Cambodia) and Mae Sot (Thailand), representing low malaria transmission, and Navrongo (Ghana) and Chikwawa (Malawi), representing high malaria transmission. In all study sites, DBLα-var relationships were shown to be predominantly 1-to-1, followed by a second largest proportion of 1-to-2 DBLα-var relationships. This finding indicates that DBLα tags can be used to estimate not just DBLα diversity but var gene diversity when applied in a local endemic area. Epidemiological applications of this result are discussed.

8.
PLoS Genet ; 17(2): e1009269, 2021 02.
Article in English | MEDLINE | ID: mdl-33630855

ABSTRACT

Malaria remains a major public health problem in many countries. Unlike influenza and HIV, where diversity in immunodominant surface antigens is understood geographically to inform disease surveillance, relatively little is known about the global population structure of PfEMP1, the major variant surface antigen of the malaria parasite Plasmodium falciparum. The complexity of the var multigene family that encodes PfEMP1 and that diversifies by recombination, has so far precluded its use in malaria surveillance. Recent studies have demonstrated that cost-effective deep sequencing of the region of var genes encoding the PfEMP1 DBLα domain and subsequent classification of within host sequences at 96% identity to define unique DBLα types, can reveal structure and strain dynamics within countries. However, to date there has not been a comprehensive comparison of these DBLα types between countries. By leveraging a bioinformatic approach (jumping hidden Markov model) designed specifically for the analysis of recombination within var genes and applying it to a dataset of DBLα types from 10 countries, we are able to describe population structure of DBLα types at the global scale. The sensitivity of the approach allows for the comparison of the global dataset to ape samples of Plasmodium Laverania species. Our analyses show that the evolution of the parasite population emerging out of Africa underlies current patterns of DBLα type diversity. Most importantly, we can distinguish geographic population structure within Africa between Gabon and Ghana in West Africa and Uganda in East Africa. Our evolutionary findings have translational implications in the context of globalization. Firstly, DBLα type diversity can provide a simple diagnostic framework for geographic surveillance of the rapidly evolving transmission dynamics of P. falciparum. It can also inform efforts to understand the presence or absence of global, regional and local population immunity to major surface antigen variants. Additionally, we identify a number of highly conserved DBLα types that are present globally that may be of biological significance and warrant further characterization.


Subject(s)
Antigens, Protozoan/genetics , Malaria, Falciparum/parasitology , Plasmodium falciparum/genetics , Protozoan Proteins/genetics , Antigenic Variation , Evolution, Molecular , Gabon , Ghana , Humans , Malaria, Falciparum/epidemiology , Markov Chains , Models, Statistical , Protein Domains , Protozoan Proteins/metabolism , Uganda
9.
Syst Biol ; 70(4): 822-837, 2021 06 16.
Article in English | MEDLINE | ID: mdl-33169795

ABSTRACT

Incomplete lineage sorting (ILS), the interaction between coalescence and speciation, can generate incongruence between gene trees and species trees, as can gene duplication (D), transfer (T), and loss (L). These processes are usually modeled independently, but in reality, ILS can affect gene copy number polymorphism, that is, interfere with DTL. This has been previously recognized, but not treated in a satisfactory way, mainly because DTL events are naturally modeled forward-in-time, while ILS is naturally modeled backward-in-time with the coalescent. Here, we consider the joint action of ILS and DTL on the gene tree/species tree problem in all its complexity. In particular, we show that the interaction between ILS and duplications/transfers (without losses) can result in patterns usually interpreted as resulting from gene loss, and that the realized rate of D, T, and L becomes nonhomogeneous in time when ILS is taken into account. We introduce algorithmic solutions to these problems. Our new model, the multilocus multispecies coalescent, which also accounts for any level of linkage between loci, generalizes the multispecies coalescent (MSC) model and offers a versatile, powerful framework for proper simulation, and inference of gene family evolution. [Gene duplication; gene loss; horizontal gene transfer; incomplete lineage sorting; multispecies coalescent; hemiplasy; recombination.].


Subject(s)
Evolution, Molecular , Gene Duplication , Models, Genetic , Multigene Family , Computer Simulation , Gene Transfer, Horizontal , Genetic Speciation , Phylogeny
10.
J Theor Biol ; 472: 54-66, 2019 07 07.
Article in English | MEDLINE | ID: mdl-30951730

ABSTRACT

The phylogenetic trees of genes and the species which they belong to are similar, but distinct due to various evolutionary processes which affect genes but do not create new species. Reconciliations map the gene tree into the species tree, explaining the discrepancies by events including gene duplications and losses. However, when duplicate genes undergo recombination (a phenomenon known as paralog exchange, or non-allelic homologous recombination), the phylogeny of the genes becomes a network, not a tree. In this paper, we explore how to reconcile a gene network to a species tree with duplications and losses. We propose an extension of the lowest common ancestor (LCA) mapping which solves the problem for tree-child gene networks, show that a restricted version of the problem is polynomial-time solvable and bounds the optimal position of each gene node in the full problem, and show that the full problem is fixed-parameter tractable in the level of the gene network. This provides a formal foundation for the development of efficient algorithms to solve this problem.


Subject(s)
Gene Regulatory Networks , Phylogeny , Algorithms , Models, Genetic
11.
Brief Bioinform ; 20(2): 426-435, 2019 03 22.
Article in English | MEDLINE | ID: mdl-28673025

ABSTRACT

We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.


Subject(s)
Evolution, Molecular , Genome , Phylogeny , Algorithms , Animals , Humans , Microbiota/genetics , Models, Genetic , Sequence Alignment , Sequence Analysis, DNA , Viruses/genetics
12.
J Theor Biol ; 432: 1-13, 2017 11 07.
Article in English | MEDLINE | ID: mdl-28801222

ABSTRACT

Gene trees and species trees can be discordant due to several processes. Standard models of reconciliations consider macro-evolutionary events at the gene level: duplications, losses and transfers of genes. However, another common source of gene tree-species tree discordance is incomplete lineage sorting (ILS), whereby gene divergences corresponding to speciations occur "out of order". However, ILS is seldom considered in reconciliation models. In this paper, we devise a unified formal IDTL reconciliation model which includes all the above mentioned processes. We show how to properly cost ILS under this model, and then give a fixed-parameter tractable (FPT) algorithm which calculates the most parsimonious IDTL reconciliation, with guaranteed time-consistency of transfer events. Provided that the number of branches in contiguous regions of the species tree in which ILS is allowed is bounded by a constant, this algorithm is linear in the number of genes and quadratic in the number of species. This provides a formal foundation to the inference of ILS in a reconciliation framework.


Subject(s)
Gene Duplication , Gene Transfer, Horizontal , Phylogeny , Algorithms , Haploidy , Models, Genetic
13.
Front Microbiol ; 8: 21, 2017.
Article in English | MEDLINE | ID: mdl-28154557

ABSTRACT

Bacteria and archaea can exchange genetic material across lineages through processes of lateral genetic transfer (LGT). Collectively, these exchange relationships can be modeled as a network and analyzed using concepts from graph theory. In particular, densely connected regions within an LGT network have been defined as genetic exchange communities (GECs). However, it has been problematic to construct networks in which edges solely represent LGT. Here we apply term frequency-inverse document frequency (TF-IDF), an alignment-free method originating from document analysis, to infer regions of lateral origin in bacterial genomes. We examine four empirical datasets of different size (number of genomes) and phyletic breadth, varying a key parameter (word length k) within bounds established in previous work. We map the inferred lateral regions to genes in recipient genomes, and construct networks in which the nodes are groups of genomes, and the edges natively represent LGT. We then extract maximum and maximal cliques (i.e., GECs) from these graphs, and identify nodes that belong to GECs across a wide range of k. Most surviving lateral transfer has happened within these GECs. Using Gene Ontology enrichment tests we demonstrate that biological processes associated with metabolism, regulation and transport are often over-represented among the genes affected by LGT within these communities. These enrichments are largely robust to change of k.

14.
Sci Rep ; 6: 29319, 2016 07 25.
Article in English | MEDLINE | ID: mdl-27452976

ABSTRACT

Many microbes can acquire genetic material from their environment and incorporate it into their genome, a process known as lateral genetic transfer (LGT). Computational approaches have been developed to detect genomic regions of lateral origin, but typically lack sensitivity, ability to distinguish donor from recipient, and scalability to very large datasets. To address these issues we have introduced an alignment-free method based on ideas from document analysis, term frequency-inverse document frequency (TF-IDF). Here we examine the performance of TF-IDF on three empirical datasets: 27 genomes of Escherichia coli and Shigella, 110 genomes of enteric bacteria, and 143 genomes across 12 bacterial and three archaeal phyla. We investigate the effect of k-mer size, gap size and delineation of groups on the inference of genomic regions of lateral origin, finding an interplay among these parameters and sequence divergence. Because TF-IDF identifies donor groups and delineates regions of lateral origin within recipient genomes, aggregating these regions by gene enables us to explore, for the first time, the mosaic nature of lateral genes including the multiplicity of biological sources, ancestry of transfer and over-writing by subsequent transfers. We carry out Gene Ontology enrichment tests to investigate which biological processes are potentially affected by LGT.


Subject(s)
Gene Transfer, Horizontal , Genome, Microbial , Bacteria/genetics , Databases, Genetic , Gene Regulatory Networks , Phylogeny
15.
Sci Rep ; 6: 30308, 2016 07 25.
Article in English | MEDLINE | ID: mdl-27453035

ABSTRACT

Lateral genetic transfer (LGT) plays an important role in the evolution of microbes. Existing computational methods for detecting genomic regions of putative lateral origin scale poorly to large data. Here, we propose a novel method based on TF-IDF (Term Frequency-Inverse Document Frequency) statistics to detect not only regions of lateral origin, but also their origin and direction of transfer, in sets of hierarchically structured nucleotide or protein sequences. This approach is based on the frequency distributions of k-mers in the sequences. If a set of contiguous k-mers appears sufficiently more frequently in another phyletic group than in its own, we infer that they have been transferred from the first group to the second. We performed rigorous tests of TF-IDF using simulated and empirical datasets. With the simulated data, we tested our method under different parameter settings for sequence length, substitution rate between and within groups and post-LGT, deletion rate, length of transferred region and k size, and found that we can detect LGT events with high precision and recall. Our method performs better than an established method, ALFY, which has high recall but low precision. Our method is efficient, with runtime increasing approximately linearly with sequence length.


Subject(s)
Evolution, Molecular , Gene Transfer, Horizontal/genetics , Sequence Analysis, DNA/methods , Sequence Analysis, Protein/methods , Computational Biology/methods , Genome , Phylogeny
16.
J Math Biol ; 71(5): 1179-209, 2015 Nov.
Article in English | MEDLINE | ID: mdl-25502987

ABSTRACT

Reconciliations between gene and species trees have important applications in the study of genome evolution (e.g. sequence orthology prediction or quantification of transfer events). While numerous methods have been proposed to infer them, little has been done to study the underlying reconciliation space. In this paper, we characterise the reconciliation space for two evolutionary models: the [Formula: see text] (duplication, loss and transfer) model and a variant of it-the no-[Formula: see text] model-which does not allow [Formula: see text] events (a transfer immediately followed by a loss). We provide formulae to compute the size of the corresponding spaces and define a set of transformation operators sufficient to explore the entire reconciliation space. We also define a distance between two reconciliations as the minimal number of operations needed to transform one into the other and prove that this distance is easily computable in the no-[Formula: see text] model. Computing this distance in the [Formula: see text] model is more difficult and it is an open question whether it is NP-hard or not. This work constitutes an important step toward reconciliation space characterisation and reconciliation comparison, needed to better assess the performance of reconciliation inference methods through simulations.


Subject(s)
Evolution, Molecular , Models, Genetic , Phylogeny , Algorithms , Computer Simulation , Gene Deletion , Gene Duplication , Gene Transfer, Horizontal , Genetic Speciation , Mathematical Concepts , Multigene Family , Species Specificity
17.
BMC Bioinformatics ; 14: 332, 2013 Nov 20.
Article in English | MEDLINE | ID: mdl-24252193

ABSTRACT

BACKGROUND: Genes located in the same chromosome region share common evolutionary events more often than other genes (e.g. a segmental duplication of this region). Their evolution may also be related if they are involved in the same protein complex or biological process. Identifying co-evolving genes can thus shed light on ancestral genome structures and functional gene interactions. RESULTS: We devise a simple, fast and accurate probability method based on species tree-gene tree reconciliations to detect when two gene families have co-evolved. Our method observes the number and location of predicted macro-evolutionary events, and estimates the probability of having the observed number of common events by chance. CONCLUSIONS: Simulation studies confirm that our method effectively identifies co-evolving families. This opens numerous perspectives on genome-scale analysis where this method could be used to pinpoint co-evolving gene families and thus help to unravel ancestral genome arrangements or undocumented gene interactions.


Subject(s)
Evolution, Molecular , Multigene Family/genetics , Computer Simulation , Genome, Bacterial , Phylogeny , Probability , Proteobacteria/genetics , Random Allocation , Segmental Duplications, Genomic
SELECTION OF CITATIONS
SEARCH DETAIL
...