Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
EBioMedicine ; 79: 103990, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-35405384

RESUMO

BACKGROUND: The sarbecovirus subgenus of betacoronaviruses is widely distributed throughout bats and other mammals globally and includes human pathogens, SARS-CoV and SARS-CoV-2. The most studied sarbecoviruses use the host protein, ACE2, to infect cells. Curiously, the majority of sarbecoviruses identified to date do not use ACE2 and cannot readily acquire ACE2 binding through point mutations. We previously screened a broad panel of sarbecovirus spikes for cell entry and observed bat-derived viruses that could infect human cells, independent of ACE2. Here we further investigate the sequence determinants of cell entry for ACE2-independent bat sarbecoviruses. METHODS: We employed a network science-based approach to visualize sequence and entry phenotype similarities across the diversity of sarbecovirus spike protein sequences. We then verified these computational results and mapped determinants of viral entry into human cells using recombinant chimeric spike proteins within an established viral pseudotype assay. FINDINGS: We show ACE2-independent viruses that can infect human and bat cells in culture have a similar putative receptor binding motif, which can impart human cell entry into other bat sarbecovirus spikes that cannot otherwise infect human cells. These sequence determinants of human cell entry map to a surface-exposed protrusion from the predicted bat sarbecovirus spike receptor binding domain structure. INTERPRETATION: Our findings provide further evidence of a group of bat-derived sarbecoviruses with zoonotic potential and demonstrate the utility in applying network science to phenotypic mapping and prediction. FUNDING: This work was supported by Washington State University and the Paul G. Allen School for Global Health.


Assuntos
COVID-19 , Quirópteros , Coronavírus Relacionado à Síndrome Respiratória Aguda Grave , Enzima de Conversão de Angiotensina 2/genética , Animais , Humanos , SARS-CoV-2 , Glicoproteína da Espícula de Coronavírus/metabolismo , Internalização do Vírus
2.
Front Bioinform ; 1: 749008, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-36303767

RESUMO

Advances in genome sequencing have accelerated the growth of sequenced genomes but at a cost in the quality of genome annotation. At the same time, computational analysis is widely used for protein annotation, but a dearth of experimental verification has contributed to inaccurate annotation as well as to annotation error propagation. Thus, a tool to help life scientists with accurate protein annotation would be useful. In this work we describe a website we have developed, the Protein Annotation Surveillance Site (PASS), which provides such a tool. This website consists of three major components: a database of homologous clusters of more than eight million protein sequences deduced from the representative genomes of bacteria, archaea, eukarya, and viruses, together with sequence information; a machine-learning software tool which periodically queries the UniprotKB database to determine whether protein function has been experimentally verified; and a query-able webpage where the FASTA headers of sequences from the cluster best matching an input sequence are returned. The user can choose from these sequences to create a sequence similarity network to assist in annotation or else use their expert knowledge to choose an annotation from the cluster sequences. Illustrations demonstrating use of this website are presented.

3.
Sci Rep ; 10(1): 11033, 2020 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-32620856

RESUMO

With the ever-increasing availability of whole-genome sequences, machine-learning approaches can be used as an alternative to traditional alignment-based methods for identifying new antimicrobial-resistance genes. Such approaches are especially helpful when pathogens cannot be cultured in the lab. In previous work, we proposed a game-theory-based feature evaluation algorithm. When using the protein characteristics identified by this algorithm, called 'features' in machine learning, our model accurately identified antimicrobial resistance (AMR) genes in Gram-negative bacteria. Here we extend our study to Gram-positive bacteria showing that coupling game-theory-identified features with machine learning achieved classification accuracies between 87% and 90% for genes encoding resistance to the antibiotics bacitracin and vancomycin. Importantly, we present a standalone software tool that implements the game-theory algorithm and machine-learning model used in these studies.


Assuntos
Antibacterianos/farmacologia , Bactérias/genética , Biologia Computacional/métodos , Farmacorresistência Bacteriana , Bacitracina/farmacologia , Bactérias/efeitos dos fármacos , Teoria dos Jogos , Aprendizado de Máquina , Testes de Sensibilidade Microbiana , Software , Vancomicina/farmacologia , Sequenciamento Completo do Genoma
4.
Microorganisms ; 8(2)2020 Feb 24.
Artigo em Inglês | MEDLINE | ID: mdl-32102454

RESUMO

Reconstructing and visualizing phylogenetic relationships among living organisms is a fundamental challenge because not all organisms share the same genes. As a result, the first phylogenetic visualizations employed a single gene, e.g., rRNA genes, sufficiently conserved to be present in all organisms but divergent enough to provide discrimination between groups. As more genome data became available, researchers began concatenating different combinations of genes or proteins to construct phylogenetic trees believed to be more robust because they incorporated more information. However, the genes or proteins chosen were based on ad hoc approaches. The large number of complete genome sequences available today allows the use of whole genomes to analyze relationships among organisms rather than using an ad hoc set of genes. We present a systematic approach for constructing a phylogenetic tree based on simultaneously clustering the complete proteomes of 360 bacterial species. From the homologous clusters, we identify 49 protein sequences shared by 99% of the organisms to build a tree. Of the 49 sequences, 47 have homologous sequences in both archaea and eukarya. The clusters are also used to create a network from which bacterial species with horizontally-transferred genes from other phyla are identified.

5.
Sci Rep ; 10(1): 1846, 2020 Jan 30.
Artigo em Inglês | MEDLINE | ID: mdl-31996773

RESUMO

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

6.
Sci Rep ; 9(1): 14487, 2019 10 09.
Artigo em Inglês | MEDLINE | ID: mdl-31597945

RESUMO

The increasing prevalence of antimicrobial-resistant bacteria drives the need for advanced methods to identify antimicrobial-resistance (AMR) genes in bacterial pathogens. With the availability of whole genome sequences, best-hit methods can be used to identify AMR genes by differentiating unknown sequences with known AMR sequences in existing online repositories. Nevertheless, these methods may not perform well when identifying resistance genes with sequences having low sequence identity with known sequences. We present a machine learning approach that uses protein sequences, with sequence identity ranging between 10% and 90%, as an alternative to conventional DNA sequence alignment-based approaches to identify putative AMR genes in Gram-negative bacteria. By using game theory to choose which protein characteristics to use in our machine learning model, we can predict AMR protein sequences for Gram-negative bacteria with an accuracy ranging from 93% to 99%. In order to obtain similar classification results, identity thresholds as low as 53% were required when using BLASTp.


Assuntos
Farmacorresistência Bacteriana/genética , Genes Bacterianos , Bactérias Gram-Negativas/efeitos dos fármacos , Bactérias Gram-Negativas/genética , Algoritmos , Sequência de Aminoácidos , Antibacterianos/farmacologia , Proteínas de Bactérias/química , Proteínas de Bactérias/genética , Enterobacter/efeitos dos fármacos , Enterobacter/genética , Teoria dos Jogos , Bactérias Gram-Negativas/patogenicidade , Humanos , Aprendizado de Máquina , Pseudomonas/efeitos dos fármacos , Pseudomonas/genética , Máquina de Vetores de Suporte , Vibrio/efeitos dos fármacos , Vibrio/genética , Sequenciamento Completo do Genoma
7.
Front Microbiol ; 10: 1391, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31293540

RESUMO

Type IV secretion systems (T4SS) are used by a number of bacterial pathogens to attack the host cell. The complex protein structure of the T4SS is used to directly translocate effector proteins into host cells, often causing fatal diseases in humans and animals. Identification of effector proteins is the first step in understanding how they function to cause virulence and pathogenicity. Accurate prediction of effector proteins via a machine learning approach can assist in the process of their identification. The main goal of this study is to predict a set of candidate effectors for the tick-borne pathogen Anaplasma phagocytophilum, the causative agent of anaplasmosis in humans. To our knowledge, we present the first computational study for effector prediction with a focus on A. phagocytophilum. In a previous study, we systematically selected a set of optimal features from more than 1,000 possible protein characteristics for predicting T4SS effector candidates. This was followed by a study of the features using the proteome of Legionella pneumophila strain Philadelphia deduced from its complete genome. In this manuscript we introduce the OPT4e software package for Optimal-features Predictor for T4SS Effector proteins. An earlier version of OPT4e was verified using cross-validation tests, accuracy tests, and comparison with previous results for L. pneumophila. We use OPT4e to predict candidate effectors from the proteomes of A. phagocytophilum strains HZ and HGE-1 and predict 48 and 46 candidates, respectively, with 16 and 18 deemed most probable as effectors. These latter include the three known validated effectors for A. phagocytophilum.

8.
Front Microbiol ; 10: 383, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30873148

RESUMO

We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta' (RpoB/RpoB'), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the groEL gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB' were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB' proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB' were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses.

9.
PLoS One ; 14(1): e0202312, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30682021

RESUMO

Type IV secretion systems exist in a number of bacterial pathogens and are used to secrete effector proteins directly into host cells in order to change their environment making the environment hospitable for the bacteria. In recent years, several machine learning algorithms have been developed to predict effector proteins, potentially facilitating experimental verification. However, inconsistencies exist between their results. Previously we analysed the disparate sets of predictive features used in these algorithms to determine an optimal set of 370 features for effector prediction. This study focuses on the best way to use these optimal features by designing three machine learning classifiers, comparing our results with those of others, and obtaining de novo results. We chose the pathogen Legionella pneumophila strain Philadelphia-1, a cause of Legionnaires' disease, because it has many validated effector proteins and others have developed machine learning prediction tools for it. While all of our models give good results indicating that our optimal features are quite robust, Model 1, which uses all 370 features with a support vector machine, has slightly better accuracy. Moreover, Model 1 predicted 472 effector proteins that are deemed highly probable to be effectors and include 94% of known effectors. Although the results of our three models agree well with those of other researchers, their models only predicted 126 and 311 candidate effectors.


Assuntos
Proteínas de Bactérias/genética , Legionella pneumophila/genética , Modelos Genéticos , Máquina de Vetores de Suporte , Sistemas de Secreção Tipo IV/genética , Fatores de Virulência/genética , Humanos , Doença dos Legionários/genética
10.
Gene ; 721S: 100010, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-34530992

RESUMO

Anaplasmosis, the most prevalent tick-transmitted disease of cattle, is caused by the rickettsial intracellular parasite Anaplasma marginale. The pathogen replicates within a parasitophorous vacuole formed from the invagination of the erythrocyte membrane. Several strains of A. marginale form "tails" or "appendages" which are attached to, and extend out from, the cytoplasmic side of the parasitophorous vacuole. Genomic analysis of the parasite antigen distributed along the appendage led to the discovery of the aaap (Anaplasma appendage associated protein) gene family located within a highly plastic region in the genome. The aaap gene family consists of aaap and several alps (for aaap-like proteins), depending on the strain. These genes/proteins are characterized by repeat sequences. To investigate locus plasticity, different versions of the locus were cloned from the same strain as well as from different strains, sequenced and aligned to identify changes. Our findings show that repeat sequences both within and between genes facilitated rearrangement events within the locus. Structural variation of the locus in the St. Maries strain was further investigated during infection of different cellular environments, i.e., bovine erythrocytes and tick cells, with a reduction in subpopulations of the aaap locus within the tick as compared to erythrocytes. Interestingly, subpopulations bearing alternative locus structures began to arise again when the pathogen was transferred from the tick environment into a naïve calf. Additionally, the Aaap protein expression profile between blood and tick samples showed a regulatory shift, indicating a host-specific response. Alignment of the protein sequences from different species of Anaplasma reveals six similar repeating motifs that appear to be unique to a few species of Anaplasma. The role the aaap locus may play in the pathogenesis of the bovine host or in tick infection/transmission remains unknown; however, the changes in aaap locus subpopulations, locus structure, and protein expression indicate that these genes have a role in strain diversification.

11.
Gene X ; 22019 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-32099970

RESUMO

Anaplasmosis, the most prevalent tick-transmitted disease of cattle, is caused by the rickettsial intracellular parasite Anaplasma marginale. The pathogen replicates within a parasitophorous vacuole formed from the invagination of the erythrocyte membrane. Several strains of A. marginale form "tails" or "appendages" which are attached to, and extend out from, the cytoplasmic side of the parasitophorous vacuole. Genomic analysis of the parasite antigen distributed along the appendage led to the discovery of the aaap (Anaplasma appendage associated protein) gene family located within a highly plastic region in the genome. The aaap gene family consists of aaap and several alps (for aaap-like proteins), depending on the strain. These genes/proteins are characterized by repeat sequences. To investigate locus plasticity, different versions of the locus were cloned from the same strain as well as from different strains, sequenced and aligned to identify changes. Our findings show that repeat sequences both within and between genes facilitated rearrangement events within the locus. Structural variation of the locus in the St. Maries strain was further investigated during infection of different cellular environments, i.e., bovine erythrocytes and tick cells, with a reduction in subpopulations of the aaap locus within the tick as compared to erythrocytes. Interestingly, subpopulations bearing alternative locus structures began to arise again when the pathogen was transferred from the tick environment into a naïve calf. Additionally, the Aaap protein expression profile between blood and tick samples showed a regulatory shift, indicating a host-specific response. Alignment of the protein sequences from different species of Anaplasma reveals six similar repeating motifs that appear to be unique to a few species of Anaplasma. The role the aaap locus may play in the pathogenesis of the bovine host or in tick infection/transmission remains unknown; however, the changes in aaap locus subpopulations, locus structure, and protein expression indicate that these genes have a role in strain diversification.

12.
PLoS One ; 13(5): e0197041, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29742157

RESUMO

Type IV secretion systems (T4SS) are multi-protein complexes in a number of bacterial pathogens that can translocate proteins and DNA to the host. Most T4SSs function in conjugation and translocate DNA; however, approximately 13% function to secrete proteins, delivering effector proteins into the cytosol of eukaryotic host cells. Upon entry, these effectors manipulate the host cell's machinery for their own benefit, which can result in serious illness or death of the host. For this reason recognition of T4SS effectors has become an important subject. Much previous work has focused on verifying effectors experimentally, a costly endeavor in terms of money, time, and effort. Having good predictions for effectors will help to focus experimental validations and decrease testing costs. In recent years, several scoring and machine learning-based methods have been suggested for the purpose of predicting T4SS effector proteins. These methods have used different sets of features for prediction, and their predictions have been inconsistent. In this paper, an optimal set of features is presented for predicting T4SS effector proteins using a statistical approach. A thorough literature search was performed to find features that have been proposed. Feature values were calculated for datasets of known effectors and non-effectors for T4SS-containing pathogens for four genera with a sufficient number of known effectors, Legionella pneumophila, Coxiella burnetii, Brucella spp, and Bartonella spp. The features were ranked, and less important features were filtered out. Correlations between remaining features were removed, and dimensional reduction was accomplished using principal component analysis and factor analysis. Finally, the optimal features for each pathogen were chosen by building logistic regression models and evaluating each model. The results based on evaluation of our logistic regression models confirm the effectiveness of our four optimal sets of features, and based on these an optimal set of features is proposed for all T4SS effector proteins.


Assuntos
Proteínas de Bactérias/genética , Interações Hospedeiro-Patógeno/genética , Infecções/genética , Sistemas de Secreção Tipo IV/genética , Bartonella/genética , Bartonella/patogenicidade , Brucella/genética , Brucella/patogenicidade , Coxiella burnetii/genética , Coxiella burnetii/patogenicidade , Genoma Bacteriano , Humanos , Infecções/microbiologia , Legionella pneumophila/genética , Legionella pneumophila/patogenicidade , Transporte Proteico/genética
13.
BMC Bioinformatics ; 19(1): 83, 2018 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-29506470

RESUMO

BACKGROUND: Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment. RESULTS: In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm. CONCLUSIONS: The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.


Assuntos
Algoritmos , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Alinhamento de Sequência/métodos , Sequência de Aminoácidos , Análise por Conglomerados , Filogenia , Domínios Proteicos , Rickettsia/classificação
14.
PLoS One ; 11(8): e0161338, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27552220

RESUMO

BACKGROUND: Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. METHODS: In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. RESULTS: We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences.


Assuntos
Sequência Conservada/genética , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos , Algoritmos , Bases de Dados de Proteínas , Domínios Proteicos , Estrutura Terciária de Proteína , Alinhamento de Sequência , Software
15.
BMC Genomics ; 17: 481, 2016 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-27368698

RESUMO

BACKGROUND: Multiple important human and livestock pathogens employ ticks as their primary host vectors. It is not currently known whether this means of infecting a host arose once or many times during evolution. RESULTS: In order to address this question, we conducted a comparative genomics analysis on a set of bacterial pathogens from seven genera - Borrelia, Rickettsia, Anaplasma, Ehrlichia, Francisella, Coxiella, and Bartonella, including species from three different host vectors - ticks, lice, and fleas. The final set of 102 genomes used in the study encoded a total of 120,046 protein sequences. We found that no genes or metabolic pathways were present in all tick-borne bacteria. However, we found some genes and pathways were present in subsets of tick-transmitted organisms while absent from bacteria transmitted by lice or fleas. CONCLUSION: Our analysis suggests that the ability of pathogens to be transmitted by ticks arose multiple times over the course of evolution. To our knowledge, this is the most comprehensive study of tick transmissibility to date.


Assuntos
Metagenoma , Metagenômica , Doenças Transmitidas por Carrapatos/microbiologia , Animais , Bactérias/classificação , Bactérias/genética , Bactérias/metabolismo , Análise por Conglomerados , Biologia Computacional/métodos , Humanos , Redes e Vias Metabólicas , Metagenômica/métodos , Ftirápteros/microbiologia , Filogenia , Sifonápteros/microbiologia , Doenças Transmitidas por Carrapatos/transmissão
16.
Infect Immun ; 83(11): 4178-84, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26259814

RESUMO

Antigenic variation allows microbial pathogens to evade immune clearance and establish persistent infection. Anaplasma marginale utilizes gene conversion of a repertoire of silent msp2 alleles into a single active expression site to encode unique Msp2 variants. As the genomic complement of msp2 alleles alone is insufficient to generate the number of variants required for persistence, A. marginale uses segmental gene conversion, in which oligonucleotide segments from multiple alleles are recombined into the expression site to generate a novel msp2 mosaic not represented elsewhere in the genome. Whether these segmental changes are sufficient to evade a broad antibody response is unknown. We addressed this question by identifying Msp2 variants that differed in primary structure within the immunogenic hypervariable region microdomains and tested whether they represented true antigenic variants. The minimal primary structural difference between variants was a single amino acid resulting from a codon insertion, and overall, the amino acid identity among paired microdomains ranged from 18 to 92%. Collectively, 89% of the expressed structural variants were also antigenic variants across all biological replicates, independent of a specific host major histocompatibility complex haplotype. Biological relevance is supported by the following: (i) all structural variants were expressed during infection of a natural host, (ii) the structural variation observed in the microdomains corresponded to the mean length of variants generated by segmental gene conversion, and (iii) antigenic variants were identified using a broad antibody response that developed during infection of a natural host. The findings demonstrate that segmental gene conversion efficiently generates Msp2 antigenic variants.


Assuntos
Anaplasma marginale/imunologia , Anaplasmose/imunologia , Variação Antigênica , Antígenos de Bactérias/química , Antígenos de Bactérias/imunologia , Proteínas da Membrana Bacteriana Externa/química , Proteínas da Membrana Bacteriana Externa/imunologia , Sequência de Aminoácidos , Anaplasma marginale/química , Anaplasma marginale/genética , Anaplasmose/microbiologia , Anticorpos Antibacterianos/imunologia , Antígenos de Bactérias/genética , Proteínas da Membrana Bacteriana Externa/genética , Humanos , Evasão da Resposta Imune , Dados de Sequência Molecular , Estrutura Terciária de Proteína , Alinhamento de Sequência
17.
Biomed Res Int ; 2015: 234236, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25695053

RESUMO

Supervised machine learning algorithms are used by life scientists for a variety of objectives. Expert-curated public gene and protein databases are major resources for gathering data to train these algorithms. While these data resources are continuously updated, generally, these updates are not incorporated into published machine learning algorithms which thereby can become outdated soon after their introduction. In this paper, we propose a new model of operation for supervised machine learning algorithms that learn from genomic data. By defining these algorithms in a pipeline in which the training data gathering procedure and the learning process are automated, one can create a system that generates a classifier or predictor using information available from public resources. The proposed model is explained using three case studies on SignalP, MemLoci, and ApicoAP in which existing machine learning models are utilized in pipelines. Given that the vast majority of the procedures described for gathering training data can easily be automated, it is possible to transform valuable machine learning algorithms into self-evolving learners that benefit from the ever-changing data available for gene products and to develop new machine learning algorithms that are similarly capable.


Assuntos
Genômica/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Inteligência Artificial , Bases de Dados Genéticas , Modelos Teóricos , Software
18.
J Microbiol Methods ; 95(3): 313-9, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-24095682

RESUMO

BACKGROUND: Computational identification of apicoplast-targeted proteins is important in drug target determination for diseases such as malaria. While there are established methods for identifying proteins with a bipartite signal in multiple species of Apicomplexa, not all apicoplast-targeted proteins possess this bipartite signature. The publication of recent experimental findings of apicoplast membrane proteins, called transmembrane proteins, that do not possess a bipartite signal has made it feasible to devise a machine learning approach for identifying this new class of apicoplast-targeted proteins computationally. METHODOLOGY/PRINCIPAL FINDINGS: In this work, we develop a method for predicting apicoplast-targeted transmembrane proteins for multiple species of Apicomplexa, whereby several classifiers trained on different feature sets and based on different algorithms are evaluated and combined in an ensemble classification model to obtain the best expected performance. The feature sets considered are the hydrophobicity and composition characteristics of amino acids over transmembrane domains, the existence of short sequence motifs over cytosolically disposed regions, and Gene Ontology (GO) terms associated with given proteins. Our model, ApicoAMP, is an ensemble classification model that combines decisions of classifiers following the majority vote principle. ApicoAMP is trained on a set of proteins from 11 apicomplexan species and achieves 91% overall expected accuracy. CONCLUSIONS/SIGNIFICANCE: ApicoAMP is the first computational model capable of identifying apicoplast-targeted transmembrane proteins in Apicomplexa. The ApicoAMP prediction software is available at http://code.google.com/p/apicoamp/ and http://bcb.eecs.wsu.edu.


Assuntos
Apicomplexa/genética , Apicoplastos/genética , Biologia Computacional/métodos , Proteínas de Membrana/genética , Proteínas de Protozoários/genética , Motivos de Aminoácidos , Aminoácidos/análise , Aminoácidos/genética , Apicomplexa/química , Apicoplastos/química , Interações Hidrofóbicas e Hidrofílicas , Proteínas de Membrana/química , Transporte Proteico , Proteínas de Protozoários/química
19.
Adv Bioinformatics ; 2013: 191586, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23509450

RESUMO

In this paper we present a new ab initio approach for constructing an unrooted dendrogram using protein clusters, an approach that has the potential for estimating relationships among several thousands of species based on their putative proteomes. We employ an open-source software program called pClust that was developed for use in metagenomic studies. Sequence alignment is performed by pClust using the Smith-Waterman algorithm, which is known to give optimal alignment and, hence, greater accuracy than BLAST-based methods. Protein clusters generated by pClust are used to create protein profiles for each species in the dendrogram, these profiles forming a correlation filter library for use with a new taxon. To augment the dendrogram with a new taxon, a protein profile for the taxon is created using BLASTp, and this new taxon is placed into a position within the dendrogram corresponding to the highest correlation with profiles in the correlation filter library. This work was initiated because of our interest in plasmids, and each step is illustrated using proteomes from Gram-negative bacterial plasmids. Proteomes for 527 plasmids were used to generate the dendrogram, and to demonstrate the utility of the insertion algorithm twelve recently sequenced pAKD plasmids were used to augment the dendrogram.

20.
Pathogens ; 2(4): 627-35, 2013 Nov 26.
Artigo em Inglês | MEDLINE | ID: mdl-25437336

RESUMO

Thousands of whole-genome and whole-proteome sequences have been made available through advances in sequencing technology, and sequences of millions more organisms will become available in the coming years. This wealth of genetic information will provide numerous opportunities to enhance our understanding of these organisms including a greater understanding of relationships among species. Researchers have used 16S rRNA and other gene sequences to study the evolutionary origins of bacteria, but these strategies do not provide insight into the sharing of genes among bacteria via horizontal transfer. In this work we use an open source software program called pClust to cluster proteins from the complete proteomes of twelve species of Alphaproteobacteria and generate a dendrogram from the resulting orthologous protein clusters. We compare the results with dendrograms constructed using the 16S rRNA gene and multiple sequence alignment of seven housekeeping genes. Analysis of the whole proteomes of these pathogens grouped Rickettsia typhi with three other animal pathogens whereas conventional sequence analysis failed to group these pathogens together. We conclude that whole-proteome analysis can give insight into relationships among species beyond their phylogeny, perhaps reflecting the effects of horizontal gene transfer and potentially providing insight into the functions of shared genes by means of shared phenotypes.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...