Search | VHL Regional Portal

OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees.

Steenwyk, Jacob L; Goltz, Dayna C; Buida, Thomas J; Li, Yuanning; Shen, Xing-Xing; Rokas, Antonis.

PLoS Biol ; 20(10): e3001827, 2022 10.

Article in English | MEDLINE | ID: mdl-36228036

ABSTRACT

Molecular evolution studies, such as phylogenomic studies and genome-wide surveys of selection, often rely on gene families of single-copy orthologs (SC-OGs). Large gene families with multiple homologs in 1 or more species-a phenomenon observed among several important families of genes such as transporters and transcription factors-are often ignored because identifying and retrieving SC-OGs nested within them is challenging. To address this issue and increase the number of markers used in molecular evolution studies, we developed OrthoSNAP, a software that uses a phylogenetic framework to simultaneously split gene families into SC-OGs and prune species-specific inparalogs. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they are identified using a splitting and pruning procedure analogous to snapping branches on a tree. From 415,129 orthologous groups of genes inferred across 7 eukaryotic phylogenomic datasets, we identified 9,821 SC-OGs; using OrthoSNAP on the remaining 405,308 orthologous groups of genes, we identified an additional 10,704 SNAP-OGs. Comparison of SNAP-OGs and SC-OGs revealed that their phylogenetic information content was similar, even in complex datasets that contain a whole-genome duplication, complex patterns of duplication and loss, transcriptome data where each gene typically has multiple transcripts, and contentious branches in the tree of life. OrthoSNAP is useful for increasing the number of markers used in molecular evolution data matrices, a critical step for robustly inferring and exploring the tree of life.

Subject(s)

Algorithms , Evolution, Molecular , Phylogeny , Pedigree , Transcription Factors

BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data.

Steenwyk, Jacob L; Buida, Thomas J; Gonçalves, Carla; Goltz, Dayna C; Morales, Grace; Mead, Matthew E; LaBella, Abigail L; Chavez, Christina M; Schmitz, Jonathan E; Hadjifrangiskou, Maria; Li, Yuanning; Rokas, Antonis.

Genetics ; 221(3)2022 07 04.

Article in English | MEDLINE | ID: mdl-35536198

ABSTRACT

Bioinformatic analysis-such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, file format conversion, and processing and analysis-is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses, but unified toolkits that conduct all these analyses are lacking. To address this gap, we introduce BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we conducted a comprehensive examination of relative synonymous codon usage across 171 fungal genomes that use alternative genetic codes, showed that the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the quality and characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).

Subject(s)

Computational Biology , Software , Codon , Sequence Alignment , Sequence Analysis, DNA

PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data.

Steenwyk, Jacob L; Buida, Thomas J; Labella, Abigail L; Li, Yuanning; Shen, Xing-Xing; Rokas, Antonis.

Bioinformatics ; 37(16): 2325-2331, 2021 Aug 25.

Article in English | MEDLINE | ID: mdl-33560364

ABSTRACT

MOTIVATION: Diverse disciplines in biology process and analyze multiple sequence alignments (MSAs) and phylogenetic trees to evaluate their information content, infer evolutionary events and processes and predict gene function. However, automated processing of MSAs and trees remains a challenge due to the lack of a unified toolkit. To fill this gap, we introduce PhyKIT, a toolkit for the UNIX shell environment with 30 functions that process MSAs and trees, including but not limited to estimation of mutation rate, evaluation of sequence composition biases, calculation of the degree of violation of a molecular clock and collapsing bipartitions (internal branches) with low support. RESULTS: To demonstrate the utility of PhyKIT, we detail three use cases: (1) summarizing information content in MSAs and phylogenetic trees for diagnosing potential biases in sequence or tree data; (2) evaluating gene-gene covariation of evolutionary rates to identify functional relationships, including novel ones, among genes and (3) identify lack of resolution events or polytomies in phylogenetic trees, which are suggestive of rapid radiation events or lack of data. We anticipate PhyKIT will be useful for processing, examining and deriving biological meaning from increasingly large phylogenomic datasets. AVAILABILITY AND IMPLEMENTATION: PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT), PyPi (https://pypi.org/project/phykit/) and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/phykit) under the MIT license with extensive documentation and user tutorials (https://jlsteenwyk.com/PhyKIT). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference.

Steenwyk, Jacob L; Buida, Thomas J; Li, Yuanning; Shen, Xing-Xing; Rokas, Antonis.

PLoS Biol ; 18(12): e3001007, 2020 12.

Article in English | MEDLINE | ID: mdl-33264284

ABSTRACT

Highly divergent sites in multiple sequence alignments (MSAs), which can stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Thus, several different trimming strategies have been developed for identifying and removing these sites prior to phylogenetic inference. However, a recent study reported that doing so can worsen inference, underscoring the need for alternative alignment trimming strategies. Here, we introduce ClipKIT, an alignment trimming software that, rather than identifying and removing putatively phylogenetically uninformative sites, instead aims to identify and retain parsimony-informative sites, which are known to be phylogenetically informative. To test the efficacy of ClipKIT, we examined the accuracy and support of phylogenies inferred from 14 different alignment trimming strategies, including those implemented in ClipKIT, across nearly 140,000 alignments from a broad sampling of evolutionary histories. Phylogenies inferred from ClipKIT-trimmed alignments are accurate, robust, and time saving. Furthermore, ClipKIT consistently outperformed other trimming methods across diverse datasets, suggesting that strategies based on identifying and retaining parsimony-informative sites provide a robust framework for alignment trimming.

Subject(s)

Sequence Alignment/methods , Sequence Analysis, DNA/methods , Algorithms , Computer Simulation , Evolution, Molecular , Models, Genetic , Phylogeny , Software

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL