Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 98
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
J Comput Biol ; 29(2): 92-105, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35073170

RESUMO

Template-based modeling (TBM), including homology modeling and protein threading, is one of the most reliable techniques for protein structure prediction. It predicts protein structure by building an alignment between the query sequence under prediction and the templates with solved structures. However, it is still very challenging to build the optimal sequence-template alignment, especially when only distantly related templates are available. Here we report a novel deep learning approach ProALIGN that can predict much more accurate sequence-template alignment. Like protein sequences consisting of sequence motifs, protein alignments are also composed of frequently occurring alignment motifs with characteristic patterns. Alignment motifs are context-specific as their characteristic patterns are tightly related to sequence contexts of the aligned regions. Inspired by this observation, we represent a protein alignment as a binary matrix (in which 1 denotes an aligned residue pair) and then use a deep convolutional neural network to predict the optimal alignment from the query protein and its template. The trained neural network implicitly but effectively encodes an alignment scoring function, which reduces inaccuracies in the handcrafted scoring functions widely used by the current threading approaches. For a query protein and a template, we apply the neural network to directly infer likelihoods of all possible residue pairs in their entirety, which could effectively consider the correlations among multiple residues. We further construct the alignment with maximum likelihood, and finally build a structure model according to the alignment. Tested on three independent data sets with a total of 6688 protein alignment targets and 80 CASP13 TBM targets, our method achieved much better alignments and 3D structure models than the existing methods, including HHpred, CNFpred, CEthreader, and DeepThreader. These results clearly demonstrate the effectiveness of exploiting the context-specific alignment motifs by deep learning for protein threading.


Assuntos
Aprendizado Profundo , Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Motivos de Aminoácidos , Sequência de Aminoácidos , Biologia Computacional , Modelos Moleculares , Redes Neurais de Computação , Conformação Proteica , Proteínas/genética , Análise de Sequência de Proteína/estatística & dados numéricos , Software
2.
Sci Rep ; 9(1): 16380, 2019 11 08.
Artigo em Inglês | MEDLINE | ID: mdl-31704957

RESUMO

An amino acid substitution scoring matrix encapsulates the rates at which various amino acid residues in proteins are substituted by other amino acid residues, over time. Database search methods make use of substitution scoring matrices to identify sequences with homologous relationships. However, widely used substitution scoring matrices, such as BLOSUM series, have been developed using aligned blocks that are mostly devoid of disordered regions in proteins. Hence, these substitution-scoring matrices are mostly inappropriate for homology searches involving proteins enriched with disordered regions as the disordered regions have distinct amino acid compositional bias, and therefore expected to have undergone amino acid substitutions that are distinct from those in the ordered regions. We, therefore, developed a novel series of substitution scoring matrices referred to as EDSSMat by exclusively considering the substitution frequencies of amino acids in the disordered regions of the eukaryotic proteins. The newly developed matrices were tested for their ability to detect homologs of proteins enriched with disordered regions by means of SSEARCH tool. The results unequivocally demonstrate that EDSSMat matrices detect more number of homologs than the widely used BLOSUM, PAM and other standard matrices, indicating their utility value for homology searches of intrinsically disordered proteins.


Assuntos
Substituição de Aminoácidos , Proteínas Intrinsicamente Desordenadas/química , Proteínas Intrinsicamente Desordenadas/genética , Sequência de Aminoácidos , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Entropia , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos , Homologia de Sequência de Aminoácidos
3.
J Math Biol ; 78(1-2): 441-463, 2019 01.
Artigo em Inglês | MEDLINE | ID: mdl-30291366

RESUMO

We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.


Assuntos
Teoria dos Jogos , Proteínas Supressoras de Tumor/genética , Sequência de Aminoácidos , Animais , Sequência de Bases , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Complexo I de Transporte de Elétrons/genética , Humanos , Conceitos Matemáticos , Proteínas Mitocondriais/genética , Modelos Biológicos , NADH Desidrogenase/genética , Dinâmica não Linear , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos , Homologia de Sequência de Aminoácidos , Máquina de Vetores de Suporte , Transferrina/genética , Proteínas Supressoras de Tumor/classificação , Proteínas Supressoras de Tumor/fisiologia , Globinas beta/genética
4.
J Proteome Res ; 17(11): 3671-3680, 2018 11 02.
Artigo em Inglês | MEDLINE | ID: mdl-30277077

RESUMO

De novo sequencing offers an alternative to database search methods for peptide identification from mass spectra. Since it does not rely on a predetermined database of expected or potential sequences in the sample, de novo sequencing is particularly appropriate for samples lacking a well-defined or comprehensive reference database. However, the low accuracy of many de novo sequence predictions has prevented the widespread use of the variety of sequencing tools currently available. Here, we present a new open-source tool, Postnovo, that postprocesses de novo sequence predictions to find high-accuracy results. Postnovo uses a predictive model to rescore and rerank candidate sequences in a manner akin to database search postprocessing tools such as Percolator. Postnovo leverages the output from multiple de novo sequencing tools in its own analyses, producing many times the length of amino acid sequence information (including both full- and partial-length peptide sequences) at an equivalent false discovery rate (FDR) compared to any individual tool. We present a methodology to reliably screen the sequence predictions to a desired FDR given the Postnovo sequence score. We validate Postnovo with multiple data sets and demonstrate its ability to identify proteins that are missed by database search even in samples with paired reference databases.


Assuntos
Algoritmos , Peptídeos/isolamento & purificação , Proteínas/química , Análise de Sequência de Proteína/estatística & dados numéricos , Software , Animais , Bacillus subtilis/química , Abelhas/química , Desulfovibrio vulgaris/química , Drosophila melanogaster/química , Embrião não Mamífero/química , Escherichia coli K12/química , Humanos , Solanum lycopersicum/química , Methanosarcina/química , Camundongos , Peptídeos/química , Peptídeos/classificação , Proteólise , Rodopseudomonas/química , Synechococcus/química
5.
J Proteome Res ; 17(11): 3614-3627, 2018 11 02.
Artigo em Inglês | MEDLINE | ID: mdl-30222357

RESUMO

Over the past decade, a suite of new mass-spectrometry-based proteomics methods has been developed that now enables the conformational properties of proteins and protein-ligand complexes to be studied in complex biological mixtures, from cell lysates to intact cells. Highlighted here are seven of the techniques in this new toolbox. These techniques include chemical cross-linking (XL-MS), hydroxyl radical footprinting (HRF), Drug Affinity Responsive Target Stability (DARTS), Limited Proteolysis (LiP), Pulse Proteolysis (PP), Stability of Proteins from Rates of Oxidation (SPROX), and Thermal Proteome Profiling (TPP). The above techniques all rely on conventional bottom-up proteomics strategies for peptide sequencing and protein identification. However, they have required the development of unconventional proteomic data analysis strategies. Discussed here are the current technical challenges associated with these different data analysis strategies as well as the relative analytical capabilities of the different techniques. The new biophysical capabilities that the above techniques bring to bear on proteomic research are also highlighted in the context of several different application areas in which these techniques have been used, including the study of protein ligand binding interactions (e.g., protein target discovery studies and protein interaction network analyses) and the characterization of biological states.


Assuntos
Espectrometria de Massas/métodos , Processamento de Proteína Pós-Traducional , Proteínas/química , Proteoma/química , Proteômica/tendências , Animais , Reagentes de Ligações Cruzadas/química , Bases de Dados de Proteínas , Medição da Troca de Deutério/métodos , Humanos , Marcação por Isótopo/métodos , Ligantes , Espectrometria de Massas/instrumentação , Ligação Proteica , Dobramento de Proteína , Estabilidade Proteica , Proteínas/metabolismo , Proteínas/ultraestrutura , Proteólise , Proteoma/ultraestrutura , Proteômica/instrumentação , Proteômica/métodos , Análise de Sequência de Proteína/instrumentação , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/estatística & dados numéricos , Termodinâmica
6.
PLoS One ; 13(8): e0202547, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30142178

RESUMO

With recent developments of data technology in biomedicine, factor data such as diagnosis codes and genomic features, which can have tens to hundreds of discrete and unorderable categorical values, have emerged. While considered as a fundamental problem in statistical analyses, the estimation of probability distribution for such factor variables has not studied much because the previous studies have mainly focused on continuous variables and discrete factor variables with a few categories such as sex and race. In this work, we propose a nonparametric Bayesian procedure to estimate the probability distribution of factors with many categories. The proposed method was demonstrated through simulation studies under various conditions and showed significant improvements on the estimation errors from the previous conventional methods. In addition, the method was applied to the analysis of diagnosis data of intensive care unit patients, and generated interesting medical hypotheses. The overall results indicate that the proposed method will be useful in the analysis of biomedical factor data.


Assuntos
Teorema de Bayes , Pesquisa Biomédica/estatística & dados numéricos , Modelos Estatísticos , Simulação por Computador , Genômica/estatística & dados numéricos , Humanos , Unidades de Terapia Intensiva , Distribuição Normal , Probabilidade , Análise de Sequência de Proteína/estatística & dados numéricos
7.
Sci Rep ; 8(1): 6800, 2018 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-29717164

RESUMO

Phylogenies based on entire genomes are a powerful tool for reconstructing the Tree of Life. Several methods have been proposed, most of which employ an alignment-free strategy. Average sequence similarity methods are different than most other whole-genome methods, because they are based on local alignments. However, previous average similarity methods fail to reconstruct a correct phylogeny when compared against other whole-genome trees. In this study, we developed a novel average sequence similarity method. Our method correctly reconstructs the phylogenetic tree of in silico evolved E. coli proteomes. We applied the method to reconstruct a whole-proteome phylogeny of 1,087 species from all three domains of life, Bacteria, Archaea, and Eucarya. Our tree was automatically reconstructed without any human decisions, such as the selection of organisms. The tree exhibits a concentric circle-like structure, indicating that all the organisms have similar total branch lengths from their common ancestor. Branching patterns of the members of each phylum of Bacteria and Archaea are largely consistent with previous reports. The topologies are largely consistent with those reconstructed by other methods. These results strongly suggest that this approach has sufficient taxonomic resolution and reliability to infer phylogeny, from phylum to strain, of a wide range of organisms.


Assuntos
Archaea/genética , Bactérias/genética , Eucariotos/genética , Genoma , Filogenia , Análise de Sequência de Proteína/estatística & dados numéricos , Algoritmos , Sequência de Aminoácidos , Archaea/classificação , Bactérias/classificação , Sequência de Bases , Escherichia coli/genética , Eucariotos/classificação , Humanos , Alinhamento de Sequência , Sequenciamento Completo do Genoma
8.
PLoS Comput Biol ; 14(1): e1005889, 2018 01.
Artigo em Inglês | MEDLINE | ID: mdl-29293498

RESUMO

Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence.


Assuntos
Proteínas/química , Proteínas/genética , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Biologia Computacional , Bases de Dados de Proteínas , Plasmodium falciparum/química , Plasmodium falciparum/genética , Domínios Proteicos , Proteínas de Protozoários/química , Proteínas de Protozoários/genética , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos
9.
Brief Bioinform ; 19(5): 821-837, 2018 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-28334258

RESUMO

Understanding of molecular mechanisms that govern protein-protein interactions and accurate modeling of protein-protein docking rely on accurate identification and prediction of protein-binding partners and protein-binding residues. We review over 40 methods that predict protein-protein interactions from protein sequences including methods that predict interacting protein pairs, protein-binding residues for a pair of interacting sequences and protein-binding residues in a single protein chain. We focus on the latter methods that provide residue-level annotations and that can be broadly applied to all protein sequences. We compare their architectures, inputs and outputs, and we discuss aspects related to their assessment and availability. We also perform first-of-its-kind comprehensive empirical comparison of representative predictors of protein-binding residues using a novel and high-quality benchmark data set. We show that the selected predictors accurately discriminate protein-binding and non-binding residues and that newer methods outperform older designs. However, these methods are unable to accurately separate residues that bind other molecules, such as DNA, RNA and small ligands, from the protein-binding residues. This cross-prediction, defined as the incorrect prediction of nucleic-acid- and small-ligand-binding residues as protein binding, is substantial for all evaluated methods and is not driven by the proximity to the native protein-binding residues. We discuss reasons for this drawback and we offer several recommendations. In particular, we postulate the need for a new generation of more accurate predictors and data sets, inclusion of a comprehensive assessment of the cross-predictions in future studies and higher standards of availability of the published methods.


Assuntos
Ligação Proteica/genética , Sequência de Aminoácidos , Sítios de Ligação/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas/estatística & dados numéricos , Ligantes , Ácidos Nucleicos/metabolismo , Domínios e Motivos de Interação entre Proteínas/genética , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Análise de Sequência de Proteína/estatística & dados numéricos , Software , Homologia Estrutural de Proteína
10.
Brief Bioinform ; 19(5): 971-981, 2018 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-28369175

RESUMO

With the advent of high-throughput proteomics, the type and amount of data pose a significant challenge to statistical approaches used to validate current quantitative analysis. Whereas many studies focus on the analysis at the protein level, the analysis of peptide-level data provides insight into changes at the sub-protein level, including splice variants, isoforms and a range of post-translational modifications. Statistical evaluation of liquid chromatography-mass spectrometry/mass spectrometry peptide-based label-free differential data is most commonly performed using a t-test or analysis of variance, often after the application of data imputation to reduce the number of missing values. In high-throughput proteomics, statistical analysis methods and imputation techniques are difficult to evaluate, given the lack of gold standard data sets. Here, we use experimental and resampled data to evaluate the performance of four statistical analysis methods and the added value of imputation, for different numbers of biological replicates. We find that three or four replicates are the minimum requirement for high-throughput data analysis and confident assignment of significant changes. Data imputation does increase sensitivity in some cases, but leads to a much higher actual false discovery rate. Additionally, we find that empirical Bayes method (limma) achieves the highest sensitivity, and we thus recommend its use for performing differential expression analysis at the peptide level.


Assuntos
Peptídeos/genética , Peptídeos/metabolismo , Proteômica/métodos , Teorema de Bayes , Cromatografia Líquida , Biologia Computacional/métodos , Simulação por Computador , Interpretação Estatística de Dados , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Análise Serial de Proteínas/estatística & dados numéricos , Proteômica/estatística & dados numéricos , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/estatística & dados numéricos , Espectrometria de Massas em Tandem
11.
Brief Bioinform ; 19(5): 954-970, 2018 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-28369237

RESUMO

While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally, we discuss the potential of de novo sequencing for now becoming more widely used in the field.


Assuntos
Algoritmos , Proteômica/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Camundongos , Peptídeos/química , Proteômica/estatística & dados numéricos , Pyrococcus furiosus/genética , Saccharomyces cerevisiae/genética , Análise de Sequência de Proteína/estatística & dados numéricos , Sitios de Sequências Rotuladas , Software , Espectrometria de Massas em Tandem/métodos , Espectrometria de Massas em Tandem/estatística & dados numéricos
12.
PLoS Comput Biol ; 14(12): e1006237, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30596639

RESUMO

Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.


Assuntos
Sítios de Ligação/fisiologia , Proteínas/química , Análise de Sequência de Proteína/métodos , Algoritmos , Modelos Moleculares , Conformação Proteica , Domínios e Motivos de Interação entre Proteínas/fisiologia , Elementos Estruturais de Proteínas , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos
13.
Mol Biol Evol ; 34(8): 2085-2100, 2017 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-28453724

RESUMO

Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural information in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters. We present a generative, evolutionary model of protein structure and sequence that is valid on a local length scale. The model concerns the local dependencies between sequence and structure evolution in a pair of homologous proteins. The evolutionary trajectory between the two structures in the protein pair is treated as a random walk in dihedral angle space, which is modeled using a novel angular diffusion process on the two-dimensional torus. Coupling sequence and structure evolution in our model allows for modeling both "smooth" conformational changes and "catastrophic" conformational jumps, conditioned on the amino acid changes. The model has interpretable parameters and is comparatively more realistic than previous stochastic models, providing new insights into the relationship between sequence and structure evolution. For example, using the trained model we were able to identify an apparent sequence-structure evolutionary motif present in a large number of homologous protein pairs. The generative nature of our model enables us to evaluate its validity and its ability to simulate aspects of protein evolution conditioned on an amino acid sequence, a related amino acid sequence, a related structure or any combination thereof.


Assuntos
Proteínas/genética , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Simulação por Computador , Evolução Molecular , Modelos Genéticos , Modelos Moleculares , Conformação Proteica , Elementos Estruturais de Proteínas/genética , Proteínas/metabolismo , Análise de Sequência de Proteína/estatística & dados numéricos
14.
Nucleic Acids Res ; 44(W1): W410-5, 2016 Jul 08.
Artigo em Inglês | MEDLINE | ID: mdl-27131380

RESUMO

The MPI Bioinformatics Toolkit (http://toolkit.tuebingen.mpg.de) is an open, interactive web service for comprehensive and collaborative protein bioinformatic analysis. It offers a wide array of interconnected, state-of-the-art bioinformatics tools to experts and non-experts alike, developed both externally (e.g. BLAST+, HMMER3, MUSCLE) and internally (e.g. HHpred, HHblits, PCOILS). While a beta version of the Toolkit was released 10 years ago, the current production-level release has been available since 2008 and has serviced more than 1.6 million external user queries. The usage of the Toolkit has continued to increase linearly over the years, reaching more than 400 000 queries in 2015. In fact, through the breadth of its tools and their tight interconnection, the Toolkit has become an excellent platform for experimental scientists as well as a useful resource for teaching bioinformatic inquiry to students in the life sciences. In this article, we report on the evolution of the Toolkit over the last ten years, focusing on the expansion of the tool repertoire (e.g. CS-BLAST, HHblits) and on infrastructural work needed to remain operative in a changing web environment.


Assuntos
Biologia Computacional/métodos , Internet , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Biologia Computacional/educação , Biologia Computacional/tendências , Anotação de Sequência Molecular , Domínios Proteicos , Proteínas/classificação , Análise de Sequência de Proteína/estatística & dados numéricos , Análise de Sequência de Proteína/tendências , Software/tendências , Ensino
15.
Nucleic Acids Res ; 44(W1): W339-43, 2016 07 08.
Artigo em Inglês | MEDLINE | ID: mdl-27106060

RESUMO

The PSI/TM-Coffee web server performs multiple sequence alignment (MSA) of proteins by combining homology extension with a consistency based alignment approach. Homology extension is performed with Position Specific Iterative (PSI) BLAST searches against a choice of redundant and non-redundant databases. The main novelty of this server is to allow databases of reduced complexity to rapidly perform homology extension. This server also gives the possibility to use transmembrane proteins (TMPs) reference databases to allow even faster homology extension on this important category of proteins. Aside from an MSA, the server also outputs topological prediction of TMPs using the HMMTOP algorithm. Previous benchmarking of the method has shown this approach outperforms the most accurate alignment methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. The web server is available at http://tcoffee.crg.cat/tmcoffee.


Assuntos
Algoritmos , Proteínas de Membrana/química , Análise de Sequência de Proteína/estatística & dados numéricos , Interface Usuário-Computador , Sequência de Aminoácidos , Gráficos por Computador , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação , Internet , Proteínas de Membrana/genética , Domínios Proteicos , Estrutura Secundária de Proteína , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos
16.
J Proteome Res ; 14(11): 4450-62, 2015 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-26412692

RESUMO

De novo sequencing of proteins and peptides is one of the most important problems in mass spectrometry-driven proteomics. A variety of methods have been developed to accomplish this task from a set of bottom-up tandem (MS/MS) mass spectra. However, a more recently emerged top-down technology, now gaining more and more popularity, opens new perspectives for protein analysis and characterization, implying a need for efficient algorithms to process this kind of MS/MS data. Here, we describe a method that allows for the retrieval, from a set of top-down MS/MS spectra, of long and accurate sequence fragments of the proteins contained in the sample. To this end, we outline a strategy for generating high-quality sequence tags from top-down spectra, and introduce the concept of a T-Bruijn graph by adapting to the case of tags the notion of an A-Bruijn graph widely used in genomics. The output of the proposed approach represents the set of amino acid strings spelled out by optimal paths in the connected components of a T-Bruijn graph. We illustrate its performance on top-down data sets acquired from carbonic anhydrase 2 (CAH2) and the Fab region of alemtuzumab.


Assuntos
Algoritmos , Peptídeos/isolamento & purificação , Proteômica/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos , Espectrometria de Massas em Tandem/estatística & dados numéricos , Alemtuzumab , Sequência de Aminoácidos , Animais , Anticorpos Monoclonais Humanizados/química , Anidrase Carbônica II/química , Bovinos , Bases de Dados de Proteínas , Humanos , Fragmentos Fab das Imunoglobulinas/química , Dados de Sequência Molecular , Peptídeos/química , Proteômica/métodos , Coloração e Rotulagem/métodos
17.
Biostatistics ; 16(3): 480-92, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25532524

RESUMO

Protein sequence data arise more and more often in vaccine and infectious disease research. These types of data are discrete, high-dimensional, and complex. We propose to study the impact of protein sequences on binary outcomes using a kernel-based logistic regression model, which models the effect of protein through a random effect whose variance-covariance matrix is mostly determined by a kernel function. We propose a novel, biologically motivated, profile hidden Markov model (HMM)-based mutual information (MI) kernel. Hypothesis testing can be carried out using the maximum of the score statistics and a parametric bootstrap procedure. To improve the power of testing, we propose intuitive modifications to the test statistic. We show through simulation studies that the profile HMM-based MI kernel can be substantially more powerful than competing kernels, and that the modified test statistics bring incremental gains in power. We use these proposed methods to investigate two problems from HIV-1 vaccine research: (1) identifying segments of HIV-1 envelope (Env) protein that confer resistance to neutralizing antibody and (2) identifying segments of Env that are associated with attenuation of protective vaccine effect by antibodies of isotype A in the RV144 vaccine trial.


Assuntos
Modelos Logísticos , Análise de Sequência de Proteína/estatística & dados numéricos , Vacinas contra a AIDS/genética , Vacinas contra a AIDS/imunologia , Anticorpos Neutralizantes/imunologia , Bioestatística , Simulação por Computador , Anticorpos Anti-HIV/imunologia , HIV-1/genética , HIV-1/imunologia , Humanos , Imunoglobulina A/imunologia , Imunoglobulina G/imunologia , Cadeias de Markov , Modelos Estatísticos , Produtos do Gene env do Vírus da Imunodeficiência Humana/genética , Produtos do Gene env do Vírus da Imunodeficiência Humana/imunologia
18.
PLoS One ; 9(1): e86703, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24475169

RESUMO

Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.


Assuntos
Algoritmos , Proteínas de Ligação a DNA/química , Proteínas de Ligação a RNA/química , Análise de Sequência de Proteína/estatística & dados numéricos , Aminoácidos/química , Aminoácidos/metabolismo , Animais , Teorema de Bayes , DNA/química , DNA/metabolismo , Proteínas de Ligação a DNA/metabolismo , Conjuntos de Dados como Assunto , Humanos , Distribuição Normal , Matrizes de Pontuação de Posição Específica , Estrutura Secundária de Proteína , Proteínas de Ligação a RNA/metabolismo , Curva ROC
19.
J Proteome Res ; 12(6): 2571-81, 2013 Jun 07.
Artigo em Inglês | MEDLINE | ID: mdl-23668635

RESUMO

Because of its high specificity, trypsin is the enzyme of choice in shotgun proteomics. Nonetheless, several publications do report the identification of semitryptic and nontryptic peptides. Many of these peptides are thought to be signaling peptides or to have formed during sample preparation. It is known that only a small fraction of tandem mass spectra from a trypsin-digested protein mixture can be confidently matched to tryptic peptides. If other possibilities such as post-translational modifications and single-amino acid polymorphisms are ignored, this suggests that many unidentified spectra originate from semitryptic and nontryptic peptides. To include them in database searches, however, may not improve overall peptide identification because of the possible sensitivity reduction from search space expansion. To circumvent this issue for E-value-based search methods, we have designed a scheme that categorizes qualified peptides (i.e., peptides whose differences in molecular weight from the parent ion are within a specified error tolerance) into three tiers: tryptic, semitryptic, and nontryptic. This classification allows peptides that belong to different tiers to have different Bonferroni correction factors. Our results show that this scheme can significantly improve retrieval performance compared to those of search strategies that assign equal Bonferroni correction factors to all qualified peptides.


Assuntos
Algoritmos , Modelos Estatísticos , Anotação de Sequência Molecular/estatística & dados numéricos , Fragmentos de Peptídeos/isolamento & purificação , Análise de Sequência de Proteína/estatística & dados numéricos , Animais , Humanos , Proteólise , Proteômica , Sensibilidade e Especificidade , Espectrometria de Massas em Tandem , Tripsina/química
20.
J Biosci ; 38(1): 173-7, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23385825

RESUMO

A palindrome is a set of characters that reads the same forwards and backwards. Since the discovery of palindromic peptide sequences two decades ago, little effort has been made to understand its structural, functional and evolutionary significance. Therefore, in view of this, an algorithm has been developed to identify all perfect palindromes (excluding the palindromic subset and tandem repeats) in a single protein sequence. The proposed algorithm does not impose any restriction on the number of residues to be given in the input sequence. This avant-garde algorithm will aid in the identification of palindromic peptide sequences of varying lengths in a single protein sequence.


Assuntos
Algoritmos , Proteínas de Bactérias/química , Chlorobium/genética , Histonas/química , Análise de Sequência de Proteína/estatística & dados numéricos , Motivos de Aminoácidos , Animais , Proteínas de Bactérias/genética , Histonas/genética , Camundongos , Dados de Sequência Molecular , Ratos , Sequências Repetitivas de Aminoácidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...