This article is a Preprint
Preprints are preliminary research reports that have not been certified by peer review. They should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Preprints posted online allow authors to receive rapid feedback and the entire scientific community can appraise the work for themselves and respond appropriately. Those comments are posted alongside the preprints for anyone to read them and serve as a post publication assessment.
Maximum likelihood pandemic-scale phylogenetics
Preprint
in English
| bioRxiv
| ID: ppbiorxiv-485312
ABSTRACT
Phylogenetics plays a crucial role in the interpretation of genomic data1. Phylogenetic analyses of SARS-CoV-2 genomes have allowed the detailed study of the viruss origins2, of its international3,4 and local4-9 spread, and of the emergence10 and reproductive success11 of new variants, among many applications. These analyses have been enabled by the unparalleled volumes of genome sequence data generated and employed to study and help contain the pandemic12. However, preferred model-based phylogenetic approaches including maximum likelihood and Bayesian methods, mostly based on Felsensteins pruning algorithm13,14, cannot scale to the size of the datasets from the current pandemic4,15, hampering our understanding of the viruss evolution and transmission16. We present new approaches, based on reworking Felsensteins algorithm, for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. We exploit near-certainty regarding ancestral genomes, and the similarities between closely related and densely sampled genomes, to greatly reduce computational demands for memory and time. Combined with new methods for searching amongst candidate evolutionary trees, this results in our MAPLE ( MAximum Parsimonious Likelihood Estimation) software giving better results than popular approaches such as FastTree 217, IQ-TREE 218, RAxML-NG19 and UShER15. Our approach therefore allows complex and accurate proba-bilistic phylogenetic analyses of millions of microbial genomes, extending the reach of genomic epidemiology. Future epidemiological datasets are likely to be even larger than those currently associated with COVID-19, and other disciplines such as metagenomics and biodiversity science are also generating huge numbers of genome sequences20-22. Our methods will permit continued use of preferred likelihood-based phylogenetic analyses.
cc_by
Full text:
Available
Collection:
Preprints
Database:
bioRxiv
Language:
English
Year:
2022
Document type:
Preprint