Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Genomics Proteomics Bioinformatics ; 21(5): 913-925, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37001856

RESUMO

Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.


Assuntos
Algoritmos , Proteínas , Conformação Proteica , Proteínas/química , Redes Neurais de Computação , Dobramento de Proteína , Biologia Computacional/métodos
2.
Bioinformatics ; 39(3)2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36916746

RESUMO

MOTIVATION: Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired. RESULTS: Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein. AVAILABILITY AND IMPLEMENTATION: The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.


Assuntos
Proteínas , Software , Sequência de Aminoácidos , Proteínas/química
3.
J Comput Biol ; 29(2): 92-105, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35073170

RESUMO

Template-based modeling (TBM), including homology modeling and protein threading, is one of the most reliable techniques for protein structure prediction. It predicts protein structure by building an alignment between the query sequence under prediction and the templates with solved structures. However, it is still very challenging to build the optimal sequence-template alignment, especially when only distantly related templates are available. Here we report a novel deep learning approach ProALIGN that can predict much more accurate sequence-template alignment. Like protein sequences consisting of sequence motifs, protein alignments are also composed of frequently occurring alignment motifs with characteristic patterns. Alignment motifs are context-specific as their characteristic patterns are tightly related to sequence contexts of the aligned regions. Inspired by this observation, we represent a protein alignment as a binary matrix (in which 1 denotes an aligned residue pair) and then use a deep convolutional neural network to predict the optimal alignment from the query protein and its template. The trained neural network implicitly but effectively encodes an alignment scoring function, which reduces inaccuracies in the handcrafted scoring functions widely used by the current threading approaches. For a query protein and a template, we apply the neural network to directly infer likelihoods of all possible residue pairs in their entirety, which could effectively consider the correlations among multiple residues. We further construct the alignment with maximum likelihood, and finally build a structure model according to the alignment. Tested on three independent data sets with a total of 6688 protein alignment targets and 80 CASP13 TBM targets, our method achieved much better alignments and 3D structure models than the existing methods, including HHpred, CNFpred, CEthreader, and DeepThreader. These results clearly demonstrate the effectiveness of exploiting the context-specific alignment motifs by deep learning for protein threading.


Assuntos
Aprendizado Profundo , Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Motivos de Aminoácidos , Sequência de Aminoácidos , Biologia Computacional , Modelos Moleculares , Redes Neurais de Computação , Conformação Proteica , Proteínas/genética , Análise de Sequência de Proteína/estatística & dados numéricos , Software
4.
Bioinformatics ; 38(4): 990-996, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-34849579

RESUMO

MOTIVATION: Accurate prediction of protein structure relies heavily on exploiting multiple sequence alignment (MSA) for residue mutations and correlations as this information specifies protein tertiary structure. The widely used prediction approaches usually transform MSA into inter-mediate models, say position-specific scoring matrix or profile hidden Markov model. These inter-mediate models, however, cannot fully represent residue mutations and correlations carried by MSA; hence, an effective way to directly exploit MSAs is highly desirable. RESULTS: Here, we report a novel sequence set network (called Seq-SetNet) to directly and effectively exploit MSA for protein structure prediction. Seq-SetNet uses an 'encoding and aggregation' strategy that consists of two key elements: (i) an encoding module that takes a component homologue in MSA as input, and encodes residue mutations and correlations into context-specific features for each residue; and (ii) an aggregation module to aggregate the features extracted from all component homologues, which are further transformed into structural properties for residues of the query protein. As Seq-SetNet encodes each homologue protein individually, it could consider both insertions and deletions, as well as long-distance correlations among residues, thus representing more information than the inter-mediate models. Moreover, the encoding module automatically learns effective features and thus avoids manual feature engineering. Using symmetric aggregation functions, Seq-SetNet processes the homologue proteins as a sequence set, making its prediction results invariable to the order of these proteins. On popular benchmark sets, we demonstrated the successful application of Seq-SetNet to predict secondary structure and torsion angles of residues with improved accuracy and efficiency. AVAILABILITY AND IMPLEMENTATION: The code and datasets are available through https://github.com/fusong-ju/Seq-SetNet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas , Software , Alinhamento de Sequência , Proteínas/genética , Proteínas/química , Estrutura Secundária de Proteína , Matrizes de Pontuação de Posição Específica , Algoritmos
5.
Nat Commun ; 12(1): 2535, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-33953201

RESUMO

Residue co-evolution has become the primary principle for estimating inter-residue distances of a protein, which are crucially important for predicting protein structure. Most existing approaches adopt an indirect strategy, i.e., inferring residue co-evolution based on some hand-crafted features, say, a covariance matrix, calculated from multiple sequence alignment (MSA) of target protein. This indirect strategy, however, cannot fully exploit the information carried by MSA. Here, we report an end-to-end deep neural network, CopulaNet, to estimate residue co-evolution directly from MSA. The key elements of CopulaNet include: (i) an encoder to model context-specific mutation for each residue; (ii) an aggregator to model residue co-evolution, and thereafter estimate inter-residue distances. Using CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrate that CopulaNet can predict protein structure with improved accuracy and efficiency. This study represents a step toward improved end-to-end prediction of inter-residue distances and protein tertiary structures.


Assuntos
Aprendizado de Máquina , Proteínas/química , Alinhamento de Sequência , Caspases/química , Biologia Computacional , Humanos , Modelos Moleculares , Mutação , Redes Neurais de Computação , Estrutura Terciária de Proteína , Proteínas/genética
6.
BMC Bioinformatics ; 21(1): 503, 2020 Nov 05.
Artigo em Inglês | MEDLINE | ID: mdl-33153432

RESUMO

BACKGROUND: The formation of contacts among protein secondary structure elements (SSEs) is an important step in protein folding as it determines topology of protein tertiary structure; hence, inferring inter-SSE contacts is crucial to protein structure prediction. One of the existing strategies infers inter-SSE contacts directly from the predicted possibilities of inter-residue contacts without any preprocessing, and thus suffers from the excessive noises existing in the predicted inter-residue contacts. Another strategy defines SSEs based on protein secondary structure prediction first, and then judges whether each candidate SSE pair could form contact or not. However, it is difficult to accurately determine boundary of SSEs due to the errors in secondary structure prediction. The incorrectly-deduced SSEs definitely hinder subsequent prediction of the contacts among them. RESULTS: We here report an accurate approach to infer the inter-SSE contacts (thus called as ISSEC) using the deep object detection technique. The design of ISSEC is based on the observation that, in the inter-residue contact map, the contacting SSEs usually form rectangle regions with characteristic patterns. Therefore, ISSEC infers inter-SSE contacts through detecting such rectangle regions. Unlike the existing approach directly using the predicted probabilities of inter-residue contact, ISSEC applies the deep convolution technique to extract high-level features from the inter-residue contacts. More importantly, ISSEC does not rely on the pre-defined SSEs. Instead, ISSEC enumerates multiple candidate rectangle regions in the predicted inter-residue contact map, and for each region, ISSEC calculates a confidence score to measure whether it has characteristic patterns or not. ISSEC employs greedy strategy to select non-overlapping regions with high confidence score, and finally infers inter-SSE contacts according to these regions. CONCLUSIONS: Comprehensive experimental results suggested that ISSEC outperformed the state-of-the-art approaches in predicting inter-SSE contacts. We further demonstrated the successful applications of ISSEC to improve prediction of both inter-residue contacts and tertiary structure as well.


Assuntos
Algoritmos , Proteínas/química , Bases de Dados de Proteínas , Proteínas de Membrana/química , Conformação Proteica em Folha beta , Estrutura Secundária de Proteína
7.
BMC Bioinformatics ; 20(1): 616, 2019 Nov 29.
Artigo em Inglês | MEDLINE | ID: mdl-31783729

RESUMO

Following publication of the original article [1], the author explained that there are several errors in the original article.

8.
BMC Bioinformatics ; 20(1): 537, 2019 Oct 29.
Artigo em Inglês | MEDLINE | ID: mdl-31664895

RESUMO

BACKGROUND: Accurate prediction of inter-residue contacts of a protein is important to calculating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective in inferring inter-residue contacts. The Markov random field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate; in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccurate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. RESULTS: In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite-likelihood, i.e., the product of conditional probability of all residue pairs. Composite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, including PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction accuracy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present a successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset. CONCLUSIONS: Composite likelihood maximization algorithm can efficiently estimate the parameters of Markov Random Fields and can improve the prediction accuracy of protein inter-residue contacts.


Assuntos
Aprendizado Profundo , Proteínas/química , Algoritmos , Probabilidade
9.
BMC Bioinformatics ; 20(Suppl 3): 135, 2019 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-30925867

RESUMO

BACKGROUND: The ab initio approaches to protein structure prediction usually employ the Monte Carlo technique to search the structural conformation that has the lowest energy. However, the widely-used energy functions are usually ineffective for conformation search. How to construct an effective energy function remains a challenging task. RESULTS: Here, we present a framework to construct effective energy functions for protein structure prediction. Unlike existing energy functions only requiring the native structure to be the lowest one, we attempt to maximize the attraction-basin where the native structure lies in the energy landscape. The underlying rationale is that each energy function determines a specific energy landscape together with a native attraction-basin, and the larger the attraction-basin is, the more likely for the Monte Carlo search procedure to find the native structure. Following this rationale, we constructed effective energy functions as follows: i) To explore the native attraction-basin determined by a certain energy function, we performed reverse Monte Carlo sampling starting from the native structure, identifying the structural conformations on the edge of attraction-basin. ii) To broaden the native attraction-basin, we smoothened the edge points of attraction-basin through tuning weights of energy terms, thus acquiring an improved energy function. Our framework alternates the broadening attraction-basin and reverse sampling steps (thus called BARS) until the native attraction-basin is sufficiently large. We present extensive experimental results to show that using the BARS framework, the constructed energy functions could greatly facilitate protein structure prediction in improving the quality of predicted structures and speeding up conformation search. CONCLUSION: Using the BARS framework, we constructed effective energy functions for protein structure prediction, which could improve the quality of predicted structures and speed up conformation search as well.


Assuntos
Biologia Computacional/métodos , Método de Monte Carlo , Proteínas/química , Algoritmos , Bases de Dados de Proteínas , Conformação Proteica , Termodinâmica
10.
Bioinformatics ; 33(23): 3749-3757, 2017 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-28961795

RESUMO

MOTIVATION: Accurate recognition of protein fold types is a key step for template-based prediction of protein structures. The existing approaches to fold recognition mainly exploit the features derived from alignments of query protein against templates. These approaches have been shown to be successful for fold recognition at family level, but usually failed at superfamily/fold levels. To overcome this limitation, one of the key points is to explore more structurally informative features of proteins. Although residue-residue contacts carry abundant structural information, how to thoroughly exploit these information for fold recognition still remains a challenge. RESULTS: In this study, we present an approach (called DeepFR) to improve fold recognition at superfamily/fold levels. The basic idea of our approach is to extract fold-specific features from predicted residue-residue contacts of proteins using deep convolutional neural network (DCNN) technique. Based on these fold-specific features, we calculated similarity between query protein and templates, and then assigned query protein with fold type of the most similar template. DCNN has showed excellent performance in image feature extraction and image recognition; the rational underlying the application of DCNN for fold recognition is that contact likelihood maps are essentially analogy to images, as they both display compositional hierarchy. Experimental results on the LINDAHL dataset suggest that even using the extracted fold-specific features alone, our approach achieved success rate comparable to the state-of-the-art approaches. When further combining these features with traditional alignment-related features, the success rate of our approach increased to 92.3%, 82.5% and 78.8% at family, superfamily and fold levels, respectively, which is about 18% higher than the state-of-the-art approach at fold level, 6% higher at superfamily level and 1% higher at family level. An independent assessment on SCOP_TEST dataset showed consistent performance improvement, indicating robustness of our approach. Furthermore, bi-clustering results of the extracted features are compatible with fold hierarchy of proteins, implying that these features are fold-specific. Together, these results suggest that the features extracted from predicted contacts are orthogonal to alignment-related features, and the combination of them could greatly facilitate fold recognition at superfamily/fold levels and template-based prediction of protein structures. AVAILABILITY AND IMPLEMENTATION: Source code of DeepFR is freely available through https://github.com/zhujianwei31415/deepfr, and a web server is available through http://protein.ict.ac.cn/deepfr. CONTACT: zheng@itp.ac.cn or dbu@ict.ac.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Dobramento de Proteína , Algoritmos , Redes Neurais de Computação , Proteínas/química , Software
11.
BMC Bioinformatics ; 18(Suppl 3): 70, 2017 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-28361691

RESUMO

BACKGROUND: Residues in a protein might be buried inside or exposed to the solvent surrounding the protein. The buried residues usually form hydrophobic cores to maintain the structural integrity of proteins while the exposed residues are tightly related to protein functions. Thus, the accurate prediction of solvent accessibility of residues will greatly facilitate our understanding of both structure and functionalities of proteins. Most of the state-of-the-art prediction approaches consider the burial state of each residue independently, thus neglecting the correlations among residues. RESULTS: In this study, we present a high-order conditional random field model that considers burial states of all residues in a protein simultaneously. Our approach exploits not only the correlation among adjacent residues but also the correlation among long-range residues. Experimental results showed that by exploiting the correlation among residues, our approach outperformed the state-of-the-art approaches in prediction accuracy. In-depth case studies also showed that by using the high-order statistical model, the errors committed by the bidirectional recurrent neural network and chain conditional random field models were successfully corrected. CONCLUSIONS: Our methods enable the accurate prediction of residue burial states, which should greatly facilitate protein structure prediction and evaluation.


Assuntos
Modelos Teóricos , Proteínas/química , Bases de Dados Factuais , Interações Hidrofóbicas e Hidrofílicas , Conformação Proteica , Reprodutibilidade dos Testes , Solventes/química
12.
Biochem Biophys Res Commun ; 472(1): 217-22, 2016 Mar 25.
Artigo em Inglês | MEDLINE | ID: mdl-26920058

RESUMO

Strategies for correlation analysis in protein contact prediction often encounter two challenges, namely, the indirect coupling among residues, and the background correlations mainly caused by phylogenetic biases. While various studies have been conducted on how to disentangle indirect coupling, the removal of background correlations still remains unresolved. Here, we present an approach for removing background correlations via low-rank and sparse decomposition (LRS) of a residue correlation matrix. The correlation matrix can be constructed using either local inference strategies (e.g., mutual information, or MI) or global inference strategies (e.g., direct coupling analysis, or DCA). In our approach, a correlation matrix was decomposed into two components, i.e., a low-rank component representing background correlations, and a sparse component representing true correlations. Finally the residue contacts were inferred from the sparse component of correlation matrix. We trained our LRS-based method on the PSICOV dataset, and tested it on both GREMLIN and CASP11 datasets. Our experimental results suggested that LRS significantly improves the contact prediction precision. For example, when equipped with the LRS technique, the prediction precision of MI and mfDCA increased from 0.25 to 0.67 and from 0.58 to 0.70, respectively (Top L/10 predicted contacts, sequence separation: 5 AA, dataset: GREMLIN). In addition, our LRS technique also consistently outperforms the popular denoising technique APC (average product correction), on both local (MI_LRS: 0.67 vs MI_APC: 0.34) and global measures (mfDCA_LRS: 0.70 vs mfDCA_APC: 0.67). Interestingly, we found out that when equipped with our LRS technique, local inference strategies performed in a comparable manner to that of global inference strategies, implying that the application of LRS technique narrowed down the performance gap between local and global inference strategies. Overall, our LRS technique greatly facilitates protein contact prediction by removing background correlations. An implementation of the approach called COLORS (improving COntact prediction using LOw-Rank and Sparse matrix decomposition) is available from http://protein.ict.ac.cn/COLORS/.


Assuntos
Domínios e Motivos de Interação entre Proteínas , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Simulação por Computador , Bases de Dados de Proteínas , Evolução Molecular , Modelos Moleculares , Modelos Estatísticos , Filogenia , Análise de Componente Principal , Conformação Proteica , Dobramento de Proteína , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Mapas de Interação de Proteínas , Análise de Sequência de Proteína
13.
Bioinformatics ; 32(3): 462-4, 2016 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-26454278

RESUMO

SUMMARY: The protein structure prediction approaches can be categorized into template-based modeling (including homology modeling and threading) and free modeling. However, the existing threading tools perform poorly on remote homologous proteins. Thus, improving fold recognition for remote homologous proteins remains a challenge. Besides, the proteome-wide structure prediction poses another challenge of increasing prediction throughput. In this study, we presented FALCON@home as a protein structure prediction server focusing on remote homologue identification. The design of FALCON@home is based on the observation that a structural template, especially for remote homologous proteins, consists of conserved regions interweaved with highly variable regions. The highly variable regions lead to vague alignments in threading approaches. Thus, FALCON@home first extracts conserved regions from each template and then aligns a query protein with conserved regions only rather than the full-length template directly. This helps avoid the vague alignments rooted in highly variable regions, improving remote homologue identification. We implemented FALCON@home using the Berkeley Open Infrastructure of Network Computing (BOINC) volunteer computing protocol. With computation power donated from over 20,000 volunteer CPUs, FALCON@home shows a throughput as high as processing of over 1000 proteins per day. In the Critical Assessment of protein Structure Prediction (CASP11), the FALCON@home-based prediction was ranked the 12th in the template-based modeling category. As an application, the structures of 880 mouse mitochondria proteins were predicted, which revealed the significant correlation between protein half-lives and protein structural factors. AVAILABILITY AND IMPLEMENTATION: FALCON@home is freely available at http://protein.ict.ac.cn/FALCON/. CONTACT: shuaicli@cityu.edu.hk, dbu@ict.ac.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Conformação Proteica , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Animais , Biologia Computacional/métodos , Bases de Dados de Proteínas , Ensaios de Triagem em Larga Escala , Camundongos
14.
Comput Biol Chem ; 53 Pt A: 118-24, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25213854

RESUMO

We extend the self-organizing approach for annotation of a bacterial genome to analyze the raw sequencing data of the human gut metagenome without sequence assembling. The original approach divides the genomic sequence of a bacterium into non-overlapping segments of equal length and assigns to each segment one of seven 'phases', among which one is for the noncoding regions, three for the direct coding regions to indicate the three possible codon positions of the segment starting site, and three for the reverse coding regions. The noncoding phase and the six coding phases are described by two frequency tables of the 64 triplet types or 'codon usages'. A set of codon usages can be used to update the phase assignment and vice versa. An iteration after an initialization leads to a convergent phase assignment to give an annotation of the genome. In the extension of the approach to a metagenome, we consider a mixture model of a number of categories described by different codon usages. The Illumina Genome Analyzer sequencing data of the total DNA from faecal samples are then examined to understand the diversity of the human gut microbiome.


Assuntos
Mapeamento Cromossômico/estatística & dados numéricos , Códon , Genoma Bacteriano , Metagenoma , Análise de Sequência de DNA/estatística & dados numéricos , Mapeamento Cromossômico/métodos , Fezes/química , Fezes/microbiologia , Trato Gastrointestinal/microbiologia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Microbiota/genética , Anotação de Sequência Molecular
15.
J Bioinform Comput Biol ; 7(1): 39-54, 2009 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-19226659

RESUMO

By means of the technique of the imbedded Markov chain, an efficient algorithm is proposed to exactly calculate first, second moments of word counts and the probability for a word to occur at least once in random texts generated by a Markov chain. A generating function is introduced directly from the imbedded Markov chain to derive asymptotic approximations for the problem. Two Z-scores, one based on the number of sequences with hits and the other on the total number of word hits in a set of sequences, are examined for discovery of motifs on a set of promoter sequences extracted from A. thaliana genome. Source code is available at http://www.itp.ac.cn/zheng/oligo.c.


Assuntos
Algoritmos , Sequência Consenso/genética , DNA/genética , Cadeias de Markov , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Interpretação Estatística de Dados , Dados de Sequência Molecular
16.
J Bioinform Comput Biol ; 6(2): 347-66, 2008 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-18464327

RESUMO

Fast, efficient, and reliable algorithms for pairwise alignment of protein structures are in ever-increasing demand for analyzing the rapidly growing data on protein structures. CLePAPS is a tool developed for this purpose. It distinguishes itself from other existing algorithms by the use of conformational letters, which are discretized states of 3D segmental structural states. A letter corresponds to a cluster of combinations of the three angles formed by Calpha pseudobonds of four contiguous residues. A substitution matrix called CLESUM is available to measure the similarity between any two such letters. CLePAPS regards an aligned fragment pair (AFP) as an ungapped string pair with a high sum of pairwise CLESUM scores. Using CLESUM scores as the similarity measure, CLePAPS searches for AFPs by simple string comparison. The transformation which best superimposes a highly similar AFP can be used to superimpose the structure pairs under comparison. A highly scored AFP which is consistent with several other AFPs determines an initial alignment. CLePAPS then joins consistent AFPs guided by their similarity scores to extend the alignment by several "zoom-in" iteration steps. A follow-up refinement produces the final alignment. CLePAPS does not implement dynamic programming. The utility of CLePAPS is tested on various protein structure pairs.


Assuntos
Algoritmos , Conformação Proteica , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Biologia Computacional/métodos , Humanos , Estrutura Secundária de Proteína , Proteínas/química
17.
Proteins ; 71(2): 728-36, 2008 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-17979193

RESUMO

CLEMAPS is a tool for multiple alignment of protein structures. It distinguishes itself from other existing algorithms for multiple structure alignment by the use of conformational letters, which are discretized states of 3D segmental structural states. A letter corresponds to a cluster of combinations of three angles formed by C(alpha) pseudobonds of four contiguous residues. A substitution matrix called CLESUM is available to measure the similarity between any two such letters. The input 3D structures are first converted to sequences of conformational letters. Each string of a fixed length is then taken as the center seed to search other sequences for neighbors of the seed, which are strings similar to the seed. A seed and its neighbors form a center-star, which corresponds to a fragment set of local structural similarity shared by many proteins. The detection of center-stars using CLESUM is extremely efficient. Local similarity is a necessary, but insufficient, condition for structural alignment. Once center-stars are found, the spatial consistency between any two stars are examined to find consistent star duads using atomic coordinates. Consistent duads are later joined to create a core for multiple alignment, which is further polished to produce the final alignment. The utility of CLEMAPS is tested on various protein structure ensembles.


Assuntos
Algoritmos , Conformação Proteica , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Biologia Computacional/métodos , Estrutura Secundária de Proteína , Proteínas/química
18.
BMC Bioinformatics ; 8: 349, 2007 Sep 19.
Artigo em Inglês | MEDLINE | ID: mdl-17880681

RESUMO

BACKGROUND: The main two sorts of automatic gene annotation frameworks are ab initio and alignment-based, the latter splitting into two sub-groups. The first group is used for intra-species alignments, among which are successful ones with high specificity and speed. The other group contains more sensitive methods which are usually applied in aligning inter-species sequences. RESULTS: Here we present a new algorithm called CAT (for Cross-species Alignment Tool). It is designed to align mRNA sequences to mammalian-sized genomes. CAT is implemented using C scripts and is freely available on the web at http://xat.sourceforge.net/. CONCLUSIONS: Examined from different angles, CAT outperforms other extant alignment tools. Tested against all available mouse-human and zebrafish-human orthologs, we demonstrate that CAT combines the specificity and speed of the best intra-species algorithms, like BLAT and sim4, with the sensitivity of the best inter-species tools, like GeneWise.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , RNA Mensageiro/genética , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Homologia de Sequência do Ácido Nucleico , Software , Animais , Sequência de Bases , Humanos , Dados de Sequência Molecular , Especificidade da Espécie
19.
J Bioinform Comput Biol ; 4(3): 769-82, 2006 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-16960974

RESUMO

Amino acid substitution matrices play an essential role in protein sequence alignment, a fundamental task in bioinformatics. Most widely used matrices, such as PAM matrices derived from homologous sequences and BLOSUM matrices derived from aligned segments of PROSITE, did not integrate conformation information in their construction. There are a few structure-based matrices, which are derived from limited data of structure alignment. Using databases PDB_SELECT and DSSP, we create a database of sequence-conformation blocks which explicitly represent sequence-structure relationship. Members in a block are identical in conformation and are highly similar in sequence. From this block database, we derive a conformation-specific amino acid substitution matrix CBSM60. The matrix shows an improved performance in conformational segment search and homolog detection.


Assuntos
Substituição de Aminoácidos , Biologia Computacional/métodos , Conformação Proteica , Proteínas/química , Sequência de Aminoácidos , Bases de Dados de Proteínas , Dados de Sequência Molecular , Dobramento de Proteína , Estrutura Secundária de Proteína , Alinhamento de Sequência , Análise de Sequência de Proteína
20.
Int J Bioinform Res Appl ; 1(4): 420-8, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-18048146

RESUMO

Instead of conformation states of single residues, refined conformation states of quintuplets are proposed to reflect conformation correlation. Simple hidden Markov models combined with sliding window scores are used to predict the secondary structure of a protein from its amino acid sequence. Since the length of protein conformation segments varies within a narrow range, we can ignore the duration effect of the length distribution. The window scores for residues are a window version of the Chou-Fasman propensities estimated under an approximation of conditional independency. Different window widths are examined, and the optimal width is found to be 17. A high accuracy of about 70% is achieved.


Assuntos
Algoritmos , Estrutura Secundária de Proteína , Sequência de Aminoácidos , Conformação Proteica , Proteínas/química
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...