Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 35
Filter
Add more filters










Publication year range
1.
Curr Protoc ; 4(5): e1046, 2024 May.
Article in English | MEDLINE | ID: mdl-38717471

ABSTRACT

Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.


Subject(s)
Polymorphism, Single Nucleotide , Software , Workflow , Polymorphism, Single Nucleotide/genetics , Computational Biology/methods , Genomics/methods , Molecular Sequence Annotation/methods , Whole Genome Sequencing/methods
2.
Bioinformatics ; 37(20): 3654-3656, 2021 Oct 25.
Article in English | MEDLINE | ID: mdl-33904572

ABSTRACT

MOTIVATION: Structure-conditioned information statistics have proven useful to predict and visualize tRNA Class-Informative Features (CIFs) and their evolutionary divergences. Although permutation P-values can quantify the significance of CIF divergences between two taxa, their naive Monte Carlo approximation is slow and inaccurate. The Peaks-over-Threshold approach of Knijnenburg et al. (2009) promises improvements to both speed and accuracy of permutation P-values, but has no publicly available API. RESULTS: We present tRNA Structure-Function Mapper (tSFM) v1.0, an open-source, multi-threaded application that efficiently computes, visualizes and assesses significance of single- and paired-site CIFs and their evolutionary divergences for any RNA, protein, gene or genomic element sequence family. Multiple estimators of permutation P-values for CIF evolutionary divergences are provided along with confidence intervals. tSFM is implemented in Python 3 with compiled C extensions and is freely available through GitHub (https://github.com/tlawrence3/tSFM) and PyPI. AVAILABILITY AND IMPLEMENTATION: The data underlying this article are available on GitHub at https://github.com/tlawrence3/tSFM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.
J Mol Evol ; 89(1-2): 103-116, 2021 02.
Article in English | MEDLINE | ID: mdl-33528599

ABSTRACT

The evolution of tRNA multigene families remains poorly understood, exhibiting unusual phenomena such as functional conversions of tRNA genes through anticodon shift substitutions. We improved FlyBase tRNA gene annotations from twelve Drosophila species, incorporating previously identified ortholog sets to compare substitution rates across tRNA bodies at single-site and base-pair resolution. All rapidly evolving sites fell within the same metal ion-binding pocket that lies at the interface of the two major stacked helical domains. We applied our tRNA Structure-Function Mapper (tSFM) method independently to each Drosophila species and one outgroup species Musca domestica and found that, although predicted tRNA structure-function maps are generally highly conserved in flies, one tRNA Class-Informative Feature (CIF) within the rapidly evolving ion-binding pocket-Cytosine 17 (C17), ancestrally informative for lysylation identity-independently gained asparaginylation identity and substituted in parallel across tRNAAsn paralogs at least once, possibly multiple times, during evolution of the genus. In D. melanogaster, most tRNALys and tRNAAsn genes are co-arrayed in one large heterologous gene cluster, suggesting that heterologous gene conversion as well as structural similarities of tRNA-binding interfaces in the closely related asparaginyl-tRNA synthetase (AsnRS) and lysyl-tRNA synthetase (LysRS) proteins may have played a role in these changes. A previously identified Asn-to-Lys anticodon shift substitution in D. ananassae may have arisen to compensate for the convergent and parallel gains of C17 in tRNAAsn paralogs in that lineage. Our results underscore the functional and evolutionary relevance of our tRNA structure-function map predictions and illuminate multiple genomic and structural factors contributing to rapid, parallel and compensatory evolution of tRNA multigene families.


Subject(s)
Drosophila melanogaster , RNA, Transfer , Animals , Anticodon/genetics , Drosophila melanogaster/genetics , Genome, Insect , RNA, Transfer/genetics
4.
PLoS Negl Trop Dis ; 14(2): e0007983, 2020 02.
Article in English | MEDLINE | ID: mdl-32106219

ABSTRACT

The development of chemotherapies against eukaryotic pathogens is especially challenging because of both the evolutionary conservation of drug targets between host and parasite, and the evolution of strain-dependent drug resistance. There is a strong need for new nontoxic drugs with broad-spectrum activity against trypanosome parasites such as Leishmania and Trypanosoma. A relatively untested approach is to target macromolecular interactions in parasites rather than small molecular interactions, under the hypothesis that the features specifying macromolecular interactions diverge more rapidly through coevolution. We computed tRNA Class-Informative Features in humans and independently in eight distinct clades of trypanosomes, identifying parasite-specific informative features, including base pairs and base mis-pairs, that are broadly conserved over approximately 250 million years of trypanosome evolution. Validating these observations, we demonstrated biochemically that tRNA:aminoacyl-tRNA synthetase (aaRS) interactions are a promising target for anti-trypanosomal drug discovery. From a marine natural products extract library, we identified several fractions with inhibitory activity toward Leishmania major alanyl-tRNA synthetase (AlaRS) but no activity against the human homolog. These marine natural products extracts showed cross-reactivity towards Trypanosoma cruzi AlaRS indicating the broad-spectrum potential of our network predictions. We also identified Leishmania major threonyl-tRNA synthetase (ThrRS) inhibitors from the same library. We discuss why chemotherapies targeting multiple aaRSs should be less prone to the evolution of resistance than monotherapeutic or synergistic combination chemotherapies targeting only one aaRS.


Subject(s)
Alanine-tRNA Ligase/antagonists & inhibitors , Antiprotozoal Agents/pharmacology , Enzyme Inhibitors/pharmacology , Leishmania/enzymology , Protozoan Proteins/antagonists & inhibitors , Threonine-tRNA Ligase/antagonists & inhibitors , Trypanosoma/drug effects , Alanine-tRNA Ligase/genetics , Alanine-tRNA Ligase/metabolism , Antiprotozoal Agents/chemistry , Enzyme Inhibitors/chemistry , Humans , Leishmania/drug effects , Leishmania/genetics , Leishmaniasis/parasitology , Protozoan Proteins/genetics , Protozoan Proteins/metabolism , Threonine-tRNA Ligase/genetics , Threonine-tRNA Ligase/metabolism , Trypanosoma/enzymology , Trypanosoma/genetics , Trypanosomiasis/parasitology
5.
BMC Evol Biol ; 19(1): 224, 2019 12 09.
Article in English | MEDLINE | ID: mdl-31818253

ABSTRACT

BACKGROUND: Eukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles. Despite recent advances in the phylogenomics of Cyanobacteria, the phylogenetic root of plastids remains controversial. Although a single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies are contradictory on whether plastids branch early or late within Cyanobacteria. One underlying cause may be poor fit of evolutionary models to complex phylogenomic data. RESULTS: Using Posterior Predictive Analysis, we show that recently applied evolutionary models poorly fit three phylogenomic datasets curated from cyanobacteria and plastid genomes because of heterogeneities in both substitution processes across sites and of compositions across lineages. To circumvent these sources of bias, we developed CYANO-MLP, a machine learning algorithm that consistently and accurately phylogenetically classifies ("phyloclassifies") cyanobacterial genomes to their clade of origin based on bioinformatically predicted function-informative features in tRNA gene complements. Classification of cyanobacterial genomes with CYANO-MLP is accurate and robust to deletion of clades, unbalanced sampling, and compositional heterogeneity in input tRNA data. CYANO-MLP consistently classifies plastid genomes into a late-branching cyanobacterial sub-clade containing single-cell, starch-producing, nitrogen-fixing ecotypes, consistent with metabolic and gene transfer data. CONCLUSIONS: Phylogenomic data of cyanobacteria and plastids exhibit both site-process heterogeneities and compositional heterogeneities across lineages. These aspects of the data require careful modeling to avoid bias in phylogenomic estimation. Furthermore, we show that amino acid recoding strategies may be insufficient to mitigate bias from compositional heterogeneities. However, the combination of our novel tRNA-specific strategy with machine learning in CYANO-MLP appears robust to these sources of bias with high accuracy in phyloclassification of cyanobacterial genomes. CYANO-MLP consistently classifies plastids as late-branching Cyanobacteria, consistent with independent evidence from signature-based approaches and some previous phylogenetic studies.


Subject(s)
Cyanobacteria/genetics , Eukaryota/cytology , Eukaryota/genetics , Plastids/genetics , Biological Evolution , Models, Biological , Photosynthesis , Phylogeny , RNA, Transfer , Symbiosis
6.
BMC Bioinformatics ; 20(1): 434, 2019 Aug 22.
Article in English | MEDLINE | ID: mdl-31438847

ABSTRACT

BACKGROUND: The epidermal growth factor receptor (EGFR) is a major regulator of proliferation in tumor cells. Elevated expression levels of EGFR are associated with prognosis and clinical outcomes of patients in a variety of tumor types. There are at least four splice variants of the mRNA encoding four protein isoforms of EGFR in humans, named I through IV. EGFR isoform I is the full-length protein, whereas isoforms II-IV are shorter protein isoforms. Nevertheless, all EGFR isoforms bind the epidermal growth factor (EGF). Although EGFR is an essential target of long-established and successful tumor therapeutics, the exact function and biomarker potential of alternative EGFR isoforms II-IV are unclear, motivating more in-depth analyses. Hence, we analyzed transcriptome data from glioblastoma cell line SF767 to predict target genes regulated by EGFR isoforms II-IV, but not by EGFR isoform I nor other receptors such as HER2, HER3, or HER4. RESULTS: We analyzed the differential expression of potential target genes in a glioblastoma cell line in two nested RNAi experimental conditions and one negative control, contrasting expression with EGF stimulation against expression without EGF stimulation. In one RNAi experiment, we selectively knocked down EGFR splice variant I, while in the other we knocked down all four EGFR splice variants, so the associated effects of EGFR II-IV knock-down can only be inferred indirectly. For this type of nested experimental design, we developed a two-step bioinformatics approach based on the Bayesian Information Criterion for predicting putative target genes of EGFR isoforms II-IV. Finally, we experimentally validated a set of six putative target genes, and we found that qPCR validations confirmed the predictions in all cases. CONCLUSIONS: By performing RNAi experiments for three poorly investigated EGFR isoforms, we were able to successfully predict 1140 putative target genes specifically regulated by EGFR isoforms II-IV using the developed Bayesian Gene Selection Criterion (BGSC) approach. This approach is easily utilizable for the analysis of data of other nested experimental designs, and we provide an implementation in R that is easily adaptable to similar data or experimental designs together with all raw datasets used in this study in the BGSC repository, https://github.com/GrosseLab/BGSC .


Subject(s)
Alternative Splicing/genetics , Computational Biology/methods , ErbB Receptors/genetics , Glioblastoma/genetics , Bayes Theorem , Cell Line, Tumor , ErbB Receptors/metabolism , Humans , Probability , Protein Isoforms/genetics , Protein Isoforms/metabolism , RNA Interference , RNA, Small Interfering/metabolism , Signal Transduction
7.
Theor Popul Biol ; 129: 68-80, 2019 10.
Article in English | MEDLINE | ID: mdl-31042487

ABSTRACT

Advances in structural biology of aminoacyl-tRNA synthetases (aaRSs) have revealed incredible diversity in how aaRSs bind their tRNA substrates. The causes of this diversity remain mysterious. We developed a new class of highly rugged fitness landscape models called match landscapes, through which genes encode the assortative interactions of their gene products through the complementarity and identifiability of their structural features. We used results from coding theory to prove bounds and equalities on fitness in match landscapes assuming additive interaction energies, macroscopic aminoacylation kinetics including proofreading, site-specific modifiers of interaction, and selection for translational accuracy in multiple, perfectly encoded site-types. Using genotypes based on extended Hamming codes we show that over a wide array of interface sizes and numbers of encoded cognate pairs, selection for translational accuracy alone is insufficient to displace the tRNA-binding interfaces of aaRSs. Yet, under combined selection for translational accuracy and rate, site-specific modifiers are selected to adaptively displace the tRNA-binding interfaces of non-cognate aaRS-tRNA pairs. We describe a remarkable correspondence between the lengths of perfect RNA (quaternary) codes and the modal sizes of small non-coding RNA families.


Subject(s)
Amino Acyl-tRNA Synthetases/genetics , Genetic Fitness/genetics , RNA, Transfer/genetics , Humans , Metagenomics , Models, Genetic , Models, Statistical
8.
Genome Biol Evol ; 9(7): 1971-1977, 2017 07 01.
Article in English | MEDLINE | ID: mdl-28810711

ABSTRACT

Candida albicans is the most common cause of life-threatening fungal infections in humans, especially in immunocompromised individuals. Crucial to its success as an opportunistic pathogen is the considerable dynamism of its genome, which readily undergoes genetic changes generating new phenotypes and shaping the evolution of new strains. Candida africana is an intriguing C. albicans biovariant strain that exhibits remarkable genetic and phenotypic differences when compared with standard C. albicans isolates. Candida africana is well-known for its low degree of virulence compared with C. albicans and for its inability to produce chlamydospores that C. albicans, characteristically, produces under certain environmental conditions. Chlamydospores are large, spherical structures, whose biological function is still unknown. For this reason, we have sequenced, assembled, and annotated the whole transcriptomes obtained from an efficient C. albicans chlamydospore-producing clinical strain (GE1), compared with the natural chlamydospore-negative C. africana clinical strain (CBS 11016). The transcriptomes of both C. albicans (GE1) and C. africana (CBS 11016) clinical strains, grown under chlamydospore-inducing conditions, were sequenced and assembled into 7,442 (GE1 strain) and 8,370 (CBS 11016 strain) high quality transcripts, respectively. The release of the first assembly of the C. africana transcriptome will allow future comparative studies to better understand the biology and evolution of this important human fungal pathogen.


Subject(s)
Candida albicans/genetics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, RNA/methods , Spores, Fungal/genetics , Transcriptome , Candida albicans/classification , Gene Expression Regulation, Fungal , Species Specificity
9.
BMC Genomics ; 17(1): 1003, 2016 12 08.
Article in English | MEDLINE | ID: mdl-27927177

ABSTRACT

BACKGROUND: While the CCA sequence at the mature 3' end of tRNAs is conserved and critical for translational function, a genetic template for this sequence is not always contained in tRNA genes. In eukaryotes and Archaea, the CCA ends of tRNAs are synthesized post-transcriptionally by CCA-adding enzymes. In Bacteria, tRNA genes template CCA sporadically. RESULTS: In order to understand the variation in how prokaryotic tRNA genes template CCA, we re-annotated tRNA genes in tRNAdb-CE database version 0.8. Among 132,129 prokaryotic tRNA genes, initiator tRNA genes template CCA at the highest average frequency (74.1%) over all functional classes except selenocysteine and pyrrolysine tRNA genes (88.1% and 100% respectively). Across bacterial phyla and a wide range of genome sizes, many lineages exist in which predominantly initiator tRNA genes template CCA. Convergent and parallel retention of CCA templating in initiator tRNA genes evolved in independent histories of reductive genome evolution in Bacteria. Also, in a majority of cyanobacterial and actinobacterial genera, predominantly initiator tRNA genes template CCA. We also found that a surprising fraction of archaeal tRNA genes template CCA. CONCLUSIONS: We suggest that cotranscriptional synthesis of initiator tRNA CCA 3' ends can complement inefficient processing of initiator tRNA precursors, "bootstrap" rapid initiation of protein synthesis from a non-growing state, or contribute to an increase in cellular growth rates by reducing overheads of mass and energy to maintain nonfunctional tRNA precursor pools. More generally, CCA templating in structurally non-conforming tRNA genes can afford cells robustness and greater plasticity to respond rapidly to environmental changes and stimuli.


Subject(s)
Bacteria/genetics , RNA Precursors/metabolism , Anticodon , Archaea/genetics , Base Pairing , Base Sequence , Databases, Genetic , Genes, Archaeal , Genes, Bacterial , RNA Precursors/chemistry , RNA, Transfer, Met/chemistry , RNA, Transfer, Met/metabolism
10.
Front Genet ; 6: 172, 2015.
Article in English | MEDLINE | ID: mdl-26042145

ABSTRACT

FAST (FAST Analysis of Sequences Toolbox) provides simple, powerful open source command-line tools to filter, transform, annotate and analyze biological sequence data. Modeled after the GNU (GNU's Not Unix) Textutils such as grep, cut, and tr, FAST tools such as fasgrep, fascut, and fastr make it easy to rapidly prototype expressive bioinformatic workflows in a compact and generic command vocabulary. Compact combinatorial encoding of data workflows with FAST commands can simplify the documentation and reproducibility of bioinformatic protocols, supporting better transparency in biological data science. Interface self-consistency and conformity with conventions of GNU, Matlab, Perl, BioPerl, R, and GenBank help make FAST easy and rewarding to learn. FAST automates numerical, taxonomic, and text-based sorting, selection and transformation of sequence records and alignment sites based on content, index ranges, descriptive tags, annotated features, and in-line calculated analytics, including composition and codon usage. Automated content- and feature-based extraction of sites and support for molecular population genetic statistics make FAST useful for molecular evolutionary analysis. FAST is portable, easy to install and secure thanks to the relative maturity of its Perl and BioPerl foundations, with stable releases posted to CPAN. Development as well as a publicly accessible Cookbook and Wiki are available on the FAST GitHub repository at https://github.com/tlawrence3/FAST. The default data exchange format in FAST is Multi-FastA (specifically, a restriction of BioPerl FastA format). Sanger and Illumina 1.8+ FastQ formatted files are also supported. FAST makes it easier for non-programmer biologists to interactively investigate and control biological data at the speed of thought.

11.
Neural Dev ; 10: 11, 2015 Apr 21.
Article in English | MEDLINE | ID: mdl-25896902

ABSTRACT

BACKGROUND: Gene expression patterns are determined by rates of mRNA transcription and decay. While transcription is known to regulate many developmental processes, the role of mRNA decay is less extensively defined. A critical step toward defining the role of mRNA decay in neural development is to measure genome-wide mRNA decay rates in neural tissue. Such information should reveal the degree to which mRNA decay contributes to differential gene expression and provide a foundation for identifying regulatory mechanisms that affect neural mRNA decay. RESULTS: We developed a technique that allows genome-wide mRNA decay measurements in intact Drosophila embryos, across all tissues and specifically in the nervous system. Our approach revealed neural-specific decay kinetics, including stabilization of transcripts encoding regulators of axonogenesis and destabilization of transcripts encoding ribosomal proteins and histones. We also identified correlations between mRNA stability and physiologic properties of mRNAs; mRNAs that are predicted to be translated within axon growth cones or dendrites have long half-lives while mRNAs encoding transcription factors that regulate neurogenesis have short half-lives. A search for candidate cis-regulatory elements identified enrichment of the Pumilio recognition element (PRE) in mRNAs encoding regulators of neurogenesis. We found that decreased expression of the RNA-binding protein Pumilio stabilized predicted neural mRNA targets and that a PRE is necessary to trigger reporter-transcript decay in the nervous system. CONCLUSIONS: We found that differential mRNA decay contributes to the relative abundance of transcripts involved in cell-fate decisions, axonogenesis, and other critical events during Drosophila neural development. Neural-specific decay kinetics and the functional specificity of mRNA decay suggest the existence of a dynamic neurodevelopmental mRNA decay network. We found that Pumilio is one component of this network, revealing a novel function for this RNA-binding protein.


Subject(s)
Drosophila melanogaster/genetics , Gene Expression Regulation, Developmental/genetics , Nervous System/embryology , Neurogenesis/genetics , RNA Stability/physiology , RNA, Messenger/metabolism , 3' Untranslated Regions/genetics , Animals , Dactinomycin/pharmacology , Dendrites/metabolism , Drosophila Proteins/biosynthesis , Drosophila Proteins/genetics , Drosophila Proteins/physiology , Drosophila melanogaster/embryology , Drosophila melanogaster/metabolism , Embryo, Nonmammalian/metabolism , Gene Ontology , Growth Cones/metabolism , Half-Life , Nervous System/metabolism , RNA-Binding Proteins/biosynthesis , RNA-Binding Proteins/genetics , RNA-Binding Proteins/physiology , Regulatory Sequences, Ribonucleic Acid/genetics , Thiouridine/metabolism , Transcription, Genetic/drug effects , Transcription, Genetic/genetics , Zygote/metabolism
12.
Front Psychol ; 5: 410, 2014.
Article in English | MEDLINE | ID: mdl-24904450

ABSTRACT

Recent research using eye-tracking typically relies on constrained visual contexts in particular goal-oriented contexts, viewing a small array of objects on a computer screen and performing some overt decision or identification. Eyetracking paradigms that use pictures as a measure of word or sentence comprehension are sometimes touted as ecologically invalid because pictures and explicit tasks are not always present during language comprehension. This study compared the comprehension of sentences with two different grammatical forms: the past progressive (e.g., was walking), which emphasizes the ongoing nature of actions, and the simple past (e.g., walked), which emphasizes the end-state of an action. The results showed that the distribution and timing of eye movements mirrors the underlying conceptual structure of this linguistic difference in the absence of any visual stimuli or task constraint: Fixations were shorter and saccades were more dispersed across the screen, as if thinking about more dynamic events when listening to the past progressive stories. Thus, eye movement data suggest that visual inputs or an explicit task are unnecessary to solicit analog representations of features such as movement, that could be a key perceptual component to grammatical comprehension.

13.
PLoS Comput Biol ; 10(2): e1003454, 2014 Feb.
Article in English | MEDLINE | ID: mdl-24586126

ABSTRACT

Molecular phylogenetics and phylogenomics are subject to noise from horizontal gene transfer (HGT) and bias from convergence in macromolecular compositions. Extensive variation in size, structure and base composition of alphaproteobacterial genomes has complicated their phylogenomics, sparking controversy over the origins and closest relatives of the SAR11 strains. SAR11 are highly abundant, cosmopolitan aquatic Alphaproteobacteria with streamlined, A+T-biased genomes. A dominant view holds that SAR11 are monophyletic and related to both Rickettsiales and the ancestor of mitochondria. Other studies dispute this, finding evidence of a polyphyletic origin of SAR11 with most strains distantly related to Rickettsiales. Although careful evolutionary modeling can reduce bias and noise in phylogenomic inference, entirely different approaches may be useful to extract robust phylogenetic signals from genomes. Here we develop simple phyloclassifiers from bioinformatically derived tRNA Class-Informative Features (CIFs), features predicted to target tRNAs for specific interactions within the tRNA interaction network. Our tRNA CIF-based model robustly and accurately classifies alphaproteobacterial genomes into one of seven undisputed monophyletic orders or families, despite great variability in tRNA gene complement sizes and base compositions. Our model robustly rejects monophyly of SAR11, classifying all but one strain as Rhizobiales with strong statistical support. Yet remarkably, conventional phylogenetic analysis of tRNAs classifies all SAR11 strains identically as Rickettsiales. We attribute this discrepancy to convergence of SAR11 and Rickettsiales tRNA base compositions. Thus, tRNA CIFs appear more robust to compositional convergence than tRNA sequences generally. Our results suggest that tRNA-CIF-based phyloclassification is robust to HGT of components of the tRNA interaction network, such as aminoacyl-tRNA synthetases. We explain why tRNAs are especially advantageous for prediction of traits governing macromolecular interactions from genomic data, and why such traits may be advantageous in the search for robust signals to address difficult problems in classification and phylogeny.


Subject(s)
Alphaproteobacteria/classification , Alphaproteobacteria/genetics , RNA, Bacterial/genetics , RNA, Transfer/genetics , Bacterial Proteins/genetics , Computational Biology , Evolution, Molecular , Gene Regulatory Networks , Gene Transfer, Horizontal , Genome, Bacterial , Models, Genetic , Phylogeny , Rhodospirillales/classification , Rhodospirillales/genetics
14.
Evol Bioinform Online ; 9: 111-25, 2013.
Article in English | MEDLINE | ID: mdl-23532367

ABSTRACT

Code-message coevolution (CMC) models represent coevolution of a genetic code and a population of protein-coding genes ("messages"). Formally, CMC models are sets of quasispecies coupled together for fitness through a shared genetic code. Although CMC models display plausible explanations for the origin of multiple genetic code traits by natural selection, useful modern implementations of CMC models are not currently available. To meet this need we present CMCpy, an object-oriented Python API and command-line executable front-end that can reproduce all published results of CMC models. CMCpy implements multiple solvers for leading eigenpairs of quasispecies models. We also present novel analytical results that extend and generalize applications of perturbation theory to quasispecies models and pioneer the application of a homotopy method for quasispecies with non-unique maximally fit genotypes. Our results therefore facilitate the computational and analytical study of a variety of evolutionary systems. CMCpy is free open-source software available from http://pypi.python.org/pypi/CMCpy/.

15.
FEBS Lett ; 584(2): 325-33, 2010 Jan 21.
Article in English | MEDLINE | ID: mdl-19944694

ABSTRACT

I review recent developments in computational analysis of tRNA identity. I suggest that the tRNA-protein interaction network is hierarchically organized, and coevolutionarily flexible. Its functional specificity of recognition and discrimination persists despite generic structural constraints and perturbative evolutionary forces. This flexibility comes from its arbitrary nature as a self-recognizing shape code. A revisualization of predicted Proteobacterial tRNA identity highlights open research problems. tRNA identity elements and their coevolution with proteins must be mapped structurally over the Tree of Life. These traits can also resolve deep roots in the Tree. I show that histidylation identity elements phylogenetically reposition Pelagibacter ubique within alpha-Proteobacteria.


Subject(s)
Computational Biology/methods , Computational Biology/trends , RNA, Transfer/chemistry , RNA, Transfer/genetics , Sequence Analysis, RNA/methods , RNA Processing, Post-Transcriptional , RNA, Transfer/metabolism , Sequence Alignment
16.
BMC Bioinformatics ; 10: 271, 2009 Aug 28.
Article in English | MEDLINE | ID: mdl-19715597

ABSTRACT

BACKGROUND: Promoter identification is a first step in the quest to explain gene regulation in bacteria. It has been demonstrated that the initiation of bacterial transcription depends upon the stability and topology of DNA in the promoter region as well as the binding affinity between the RNA polymerase sigma-factor and promoter. However, promoter prediction algorithms to date have not explicitly used an ensemble of these factors as predictors. In addition, most promoter models have been trained on data from Escherichia coli. Although it has been shown that transcriptional mechanisms are similar among various bacteria, it is quite possible that the differences between Escherichia coli and Chlamydia trachomatis are large enough to recommend an organism-specific modeling effort. RESULTS: Here we present an iterative stochastic model building procedure that combines such biophysical metrics as DNA stability, curvature, twist and stress-induced DNA duplex destabilization along with duration hidden Markov model parameters to model Chlamydia trachomatis sigma66 promoters from 29 experimentally verified sequences. Initially, iterative duration hidden Markov modeling of the training set sequences provides a scoring algorithm for Chlamydia trachomatis RNA polymerase sigma66/DNA binding. Subsequently, an iterative application of Stepwise Binary Logistic Regression selects multiple promoter predictors and deletes/replaces training set sequences to determine an optimal training set. The resulting model predicts the final training set with a high degree of accuracy and provides insights into the structure of the promoter region. Model based genome-wide predictions are provided so that optimal promoter candidates can be experimentally evaluated, and refined models developed. Co-predictions with three other algorithms are also supplied to enhance reliability. CONCLUSION: This strategy and resulting model support the conjecture that DNA biophysical properties, along with RNA polymerase sigma-factor/DNA binding collaboratively, contribute to a sequence's ability to promote transcription. This work provides a baseline model that can evolve as new Chlamydia trachomatis sigma66 promoters are identified with assistance from the provided genome-wide predictions. The proposed methodology is ideal for organisms with few identified promoters and relatively small genomes.


Subject(s)
Bacterial Proteins/genetics , Chlamydia trachomatis/genetics , Computational Biology/methods , Markov Chains , Promoter Regions, Genetic , Sigma Factor/chemistry , Sigma Factor/genetics , Algorithms , Bacterial Proteins/chemistry , Biophysics/methods , Genes, Bacterial , Genome, Bacterial
17.
Proteins ; 77(3): 499-508, 2009 Nov 15.
Article in English | MEDLINE | ID: mdl-19507241

ABSTRACT

Protein structures change during evolution in response to mutations. Here, we analyze the mapping between sequence and structure in a set of structurally aligned protein domains. To avoid artifacts, we restricted our attention only to the core components of these structures. We found that on average, using different measures of structural change, protein cores evolve linearly with evolutionary distance (amino acid substitutions per site). This is true irrespective of which measure of structural change we used, whether RMSD or discrete structural descriptors for secondary structure, accessibility, or contacts. This linear response allows us to quantify the claim that structure is more conserved than sequence. Using structural alphabets of similar cardinality to the sequence alphabet, structural cores evolve three to ten times slower than sequences. Although we observed an average linear response, we found a wide variance. Different domain families varied fivefold in structural response to evolution. An attempt to categorically analyze this variance among subgroups by structural and functional category revealed only one statistically significant trend. This trend can be explained by the fact that beta-sheets change faster than alpha-helices, most likely due to that they are shorter and that change occurs at the ends of the secondary structure elements.


Subject(s)
Computational Biology/methods , Proteins/chemistry , Amino Acids/chemistry , Conserved Sequence , Databases, Protein , Evolution, Molecular , Molecular Conformation , Mutation , Protein Conformation , Protein Structure, Secondary , Protein Structure, Tertiary , Proteomics/methods , Regression Analysis , Sequence Alignment
18.
Genome Res ; 18(6): 888-99, 2008 Jun.
Article in English | MEDLINE | ID: mdl-18347326

ABSTRACT

Genome data are increasingly important in the computational identification of novel regulatory non-coding RNAs (ncRNAs). However, most ncRNA gene-finders are either specialized to well-characterized ncRNA gene families or require comparisons of closely related genomes. We developed a method for de novo screening for ncRNA genes with a nucleotide composition that stands out against the background genome based on a partial sum process. We compared the performance when assuming independent and first-order Markov-dependent nucleotides, respectively, and used Karlin-Altschul and Karlin-Dembo statistics to evaluate the significance of hits. We hypothesized that a first-order Markov-dependent process might have better power to detect ncRNA genes since nearest-neighbor models have been shown to be successful in predicting RNA structures. A model based on a first-order partial sum process (analyzing overlapping dinucleotides) had better sensitivity and specificity than a zeroth-order model when applied to the AT-rich genome of the amoeba Dictyostelium discoideum. In this genome, we detected 94% of previously known ncRNA genes (at this sensitivity, the false positive rate was estimated to be 25% in a simulated background). The predictions were further refined by clustering candidate genes according to sequence similarity and/or searching for an ncRNA-associated upstream element. We experimentally verified six out of 10 tested ncRNA gene predictions. We conclude that higher-order models, in combination with other information, are useful for identification of novel ncRNA gene families in single-genome analysis of D. discoideum. Our generalizable approach extends the range of genomic data that can be searched for novel ncRNA genes using well-grounded statistical methods.


Subject(s)
Dictyostelium/genetics , Genomics/methods , RNA, Untranslated/genetics , Adenine/analysis , Animals , Base Composition , Base Sequence , Conserved Sequence , Genes, Protozoan , Genome, Protozoan , Markov Chains , Molecular Sequence Data , Multigene Family , Nucleotides/analysis , RNA, Untranslated/chemistry , RNA, Untranslated/metabolism , Thymine/analysis
19.
Nature ; 450(7167): 203-18, 2007 Nov 08.
Article in English | MEDLINE | ID: mdl-17994087

ABSTRACT

Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.


Subject(s)
Drosophila/classification , Drosophila/genetics , Evolution, Molecular , Genes, Insect/genetics , Genome, Insect/genetics , Genomics , Phylogeny , Animals , Codon/genetics , DNA Transposable Elements/genetics , Drosophila/immunology , Drosophila/metabolism , Drosophila Proteins/genetics , Gene Order/genetics , Genome, Mitochondrial/genetics , Immunity/genetics , Multigene Family/genetics , RNA, Untranslated/genetics , Reproduction/genetics , Sequence Alignment , Sequence Analysis, DNA , Synteny/genetics
20.
J Bacteriol ; 189(24): 8993-9000, 2007 Dec.
Article in English | MEDLINE | ID: mdl-17951392

ABSTRACT

Expression of minigenes encoding tetra- or pentapeptides MXLX or MXLXV (E peptides), where X is a nonpolar amino acid, renders cells erythromycin resistant whereas expression of minigenes encoding tripeptide MXL does not. By using a 3A' reporter gene system beginning with an E-peptide-encoding sequence, we asked whether the codons UGG and GGG, which are known to promote peptidyl-tRNA drop-off at early positions in mRNA, would result in a phenotype of erythromycin resistance if located after this sequence. We find that UGG or GGG, at either position +4 or +5, without a following stop codon, is associated with an erythromycin resistance phenotype upon gene induction. Our results suggest that, while a stop codon at +4 gives a tripeptide product (MIL) and erythromycin sensitivity, UGG or GGG codons at the same position give a tetrapeptide product (MILW or MILG) and phenotype of erythromycin resistance. Thus, the drop-off event on GGG or UGG codons occurs after incorporation of the corresponding amino acid into the growing peptide chain. Drop-off gives rise to a peptidyl-tRNA where the peptide moiety functionally mimics a minigene peptide product of the type previously associated with erythromycin resistance. Several genes in Escherichia coli fulfill the requirements of high mRNA expression and an E-peptide sequence followed by UGG or GGG at position +4 or +5 and should potentially be able to give an erythromycin resistance phenotype.


Subject(s)
Anti-Bacterial Agents/pharmacology , Codon/genetics , Drug Resistance, Bacterial , Erythromycin/pharmacology , Escherichia coli/drug effects , Protein Biosynthesis , RNA, Transfer, Amino Acyl/metabolism , Genes, Reporter , Oligopeptides/biosynthesis , Staphylococcal Protein A/biosynthesis , Staphylococcal Protein A/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...