Search | VHL Regional Portal

Exact and efficient phylodynamic simulation from arbitrarily large populations.

Celentano, Michael; DeWitt, William S; Prillo, Sebastian; Song, Yun S.

ArXiv ; 2024 Feb 27.

Article in English | MEDLINE | ID: mdl-38463501

ABSTRACT

Many biological studies involve inferring the genealogical history of a sample of individuals from a large population and interpreting the reconstructed tree. Such an ascertained tree typically represents only a small part of a comprehensive population tree and is distorted by survivorship and sampling biases. Inferring evolutionary parameters from ascertained trees requires modeling both the underlying population dynamics and the ascertainment process. A crucial component of this phylodynamic modeling involves tree simulation, which is used to benchmark probabilistic inference methods. To simulate an ascertained tree, one must first simulate the full population tree and then prune unobserved lineages. Consequently, the computational cost is determined not by the size of the final simulated tree, but by the size of the population tree in which it is embedded. In most biological scenarios, simulations of the entire population are prohibitively expensive due to computational demands placed on lineages without sampled descendants. Here, we address this challenge by proving that, for any partially ascertained process from a general multi-type birth-death-mutation-sampling (BDMS) model, there exists an equivalent pure birth process (i.e., no death) with mutation and complete sampling. The final trees generated under these processes have exactly the same distribution. Leveraging this property, we propose a highly efficient algorithm for simulating trees under a general BDMS model. Our algorithm scales linearly with the size of the final simulated tree and is independent of the population size, enabling simulations from extremely large populations beyond the reach of current methods but essential for various biological applications. We anticipate that this unprecedented speedup will significantly advance the development of novel inference methods that require extensive training data.

ConvexML: Scalable and accurate inference of single-cell chronograms from CRISPR/Cas9 lineage tracing data.

Prillo, Sebastian; Ravoor, Akshay; Yosef, Nir; Song, Yun S.

bioRxiv ; 2023 Dec 03.

Article in English | MEDLINE | ID: mdl-38076815

ABSTRACT

CRISPR/Cas9 gene editing technology has enabled lineage tracing for thousands of cells in vivo. However, most of the analysis of CRISPR/Cas9 lineage tracing data has so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population - among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we leverage a statistical model of CRISPR/Cas9 cutting with missing data, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. As part of our method, we propose a novel approach to represent and handle missing data - specifically, double-resection events - which greatly simplifies and speeds up branch length estimation without compromising quality. All this leads to a convex maximum likelihood estimation (MLE) problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. To stabilize estimates in low-information regimes, we propose a simple penalized version of MLE using a minimum branch length and pseudocounts. We benchmark our method using simulations and show that it performs well on several tasks, outperforming more naive baselines. Our method, which we name 'ConvexML', is available through the cassiopeia open source Python package.

CherryML: scalable maximum likelihood estimation of phylogenetic models.

Prillo, Sebastian; Deng, Yun; Boyeau, Pierre; Li, Xingyu; Chen, Po-Yen; Song, Yun S.

Nat Methods ; 20(8): 1232-1236, 2023 08.

Article in English | MEDLINE | ID: mdl-37386188

ABSTRACT

Phylogenetic models of molecular evolution are central to numerous biological applications spanning diverse timescales, from hundreds of millions of years involving orthologous proteins to just tens of days relating to single cells within an organism. A fundamental problem in these applications is estimating model parameters, for which maximum likelihood estimation is typically employed. Unfortunately, maximum likelihood estimation is a computationally expensive task, in some cases prohibitively so. To address this challenge, we here introduce CherryML, a broadly applicable method that achieves several orders of magnitude speedup by using a quantized composite likelihood over cherries in the trees. The massive speedup offered by our method should enable researchers to consider more complex and biologically realistic models than previously possible. Here we demonstrate CherryML's utility by applying it to estimate a general 400 × 400 rate matrix for residue-residue coevolution at contact sites in three-dimensional protein structures; we estimate that using current state-of-the-art methods such as the expectation-maximization algorithm for the same task would take >100,000 times longer.

Subject(s)

Evolution, Molecular , Proteins , Phylogeny , Likelihood Functions , Algorithms , Models, Genetic

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL