Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 30
Filter
Add more filters











Publication year range
1.
Bioinformatics ; 38(11): 2988-2995, 2022 05 26.
Article in English | MEDLINE | ID: mdl-35385080

ABSTRACT

MOTIVATION: A high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to generate a SA. However, when given a choice of more than one SA for a protein sequence, there are no methods to predict which SA may lead to more accurate models without actually building the models. In this work, we describe a method to predict the quality of a protein's SA. RESULTS: We created our own dataset by generating a variety of SAs for a set of 1351 representative proteins and investigated various deep learning architectures to predict the local distance difference test (lDDT) scores of distance maps predicted with SAs as the input. These lDDT scores serve as indicators of the quality of the SAs.Using two independent test datasets consisting of CASP13 and CASP14 targets, we show that our method is effective for scoring and ranking SAs when a pool of SAs is available for a protein sequence. With an example, we further discuss that SA selection using our method can lead to improved structure prediction. AVAILABILITY AND IMPLEMENTATION: Code and the data underlying this article are available at https://github.com/ba-lab/Alignment-Score/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Sequence Alignment , Computational Biology/methods , Proteins/chemistry , Amino Acid Sequence
2.
IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3586-3594, 2022.
Article in English | MEDLINE | ID: mdl-34559660

ABSTRACT

BACKGROUND: Much of the recent success in protein structure prediction has been a result of accurate protein contact prediction-a binary classification problem. Dozens of methods, built from various types of machine learning and deep learning algorithms, have been published over the last two decades for predicting contacts. Recently, many groups, including Google DeepMind, have demonstrated that reformulating the problem as a multi-class classification problem is a more promising direction to pursue. As an alternative approach, we recently proposed real-valued distance predictions, formulating the problem as a regression problem. The nuances of protein 3D structures make this formulation appropriate, allowing predictions to reflect inter-residue distances in nature. Despite these promises, the accurate prediction of real-valued distances remains relatively unexplored; possibly due to classification being better suited to machine and deep learning algorithms. METHODS: Can regression methods be designed to predict real-valued distances as precise as binary contacts? To investigate this, we propose multiple novel methods of input label engineering, which is different from feature engineering, with the goal of optimizing the distribution of distances to cater to the loss function of the deep-learning model. Since an important utility of predicted contacts or distances is to build three-dimensional models, we also tested if predicted distances can reconstruct more accurate models than contacts. RESULTS: Our results demonstrate, for the first time, that deep learning methods for real-valued protein distance prediction can deliver distances as precise as binary classification methods. When using an optimal distance transformation function on the standard PSICOV dataset consisting of 150 representative proteins, the precision of 'top-all' long-range contacts improves from 60.9% to 61.4% when predicting real-valued distances instead of contacts. When building three-dimensional models we observed an average TM-score increase from 0.61 to 0.72, highlighting the advantage of predicting real-valued distances.


Subject(s)
Deep Learning , Computational Biology/methods , Proteins/chemistry , Algorithms , Machine Learning
3.
Int J Mol Sci ; 22(11)2021 May 24.
Article in English | MEDLINE | ID: mdl-34074028

ABSTRACT

Obtaining an accurate description of protein structure is a fundamental step toward understanding the underpinning of biology. Although recent advances in experimental approaches have greatly enhanced our capabilities to experimentally determine protein structures, the gap between the number of protein sequences and known protein structures is ever increasing. Computational protein structure prediction is one of the ways to fill this gap. Recently, the protein structure prediction field has witnessed a lot of advances due to Deep Learning (DL)-based approaches as evidenced by the success of AlphaFold2 in the most recent Critical Assessment of protein Structure Prediction (CASP14). In this article, we highlight important milestones and progresses in the field of protein structure prediction due to DL-based methods as observed in CASP experiments. We describe advances in various steps of protein structure prediction pipeline viz. protein contact map prediction, protein distogram prediction, protein real-valued distance prediction, and Quality Assessment/refinement. We also highlight some end-to-end DL-based approaches for protein structure prediction approaches. Additionally, as there have been some recent DL-based advances in protein structure determination using Cryo-Electron (Cryo-EM) microscopy based, we also highlight some of the important progress in the field. Finally, we provide an outlook and possible future research directions for DL-based approaches in the protein structure prediction arena.


Subject(s)
Computational Biology/methods , Cryoelectron Microscopy/methods , Deep Learning , Proteins/chemistry , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence , Databases, Protein , Models, Molecular , Neural Networks, Computer , Protein Conformation , Software
4.
BMC Bioinformatics ; 22(1): 8, 2021 Jan 06.
Article in English | MEDLINE | ID: mdl-33407077

ABSTRACT

BACKGROUND: Protein inter-residue contact and distance prediction are two key intermediate steps essential to accurate protein structure prediction. Distance prediction comes in two forms: real-valued distances and 'binned' distograms, which are a more finely grained variant of the binary contact prediction problem. The latter has been introduced as a new challenge in the 14th Critical Assessment of Techniques for Protein Structure Prediction (CASP14) 2020 experiment. Despite the recent proliferation of methods for predicting distances, few methods exist for evaluating these predictions. Currently only numerical metrics, which evaluate the entire prediction at once, are used. These give no insight into the structural details of a prediction. For this reason, new methods and tools are needed. RESULTS: We have developed a web server for evaluating predicted inter-residue distances. Our server, DISTEVAL, accepts predicted contacts, distances, and a true structure as optional inputs to generate informative heatmaps, chord diagrams, and 3D models. All of these outputs facilitate visual and qualitative assessment. The server also evaluates predictions using other metrics such as mean absolute error, root mean squared error, and contact precision. CONCLUSIONS: The visualizations generated by DISTEVAL complement each other and collectively serve as a powerful tool for both quantitative and qualitative assessments of predicted contacts and distances, even in the absence of a true 3D structure.


Subject(s)
Computational Biology/methods , Internet , Models, Molecular , Proteins , Amino Acids/chemistry , Amino Acids/metabolism , Protein Conformation , Proteins/chemistry , Proteins/metabolism
5.
Sci Rep ; 10(1): 13374, 2020 08 07.
Article in English | MEDLINE | ID: mdl-32770096

ABSTRACT

As deep learning algorithms drive the progress in protein structure prediction, a lot remains to be studied at this merging superhighway of deep learning and protein structure prediction. Recent findings show that inter-residue distance prediction, a more granular version of the well-known contact prediction problem, is a key to predicting accurate models. However, deep learning methods that predict these distances are still in the early stages of their development. To advance these methods and develop other novel methods, a need exists for a small and representative dataset packaged for faster development and testing. In this work, we introduce protein distance net (PDNET), a framework that consists of one such representative dataset along with the scripts for training and testing deep learning methods. The framework also includes all the scripts that were used to curate the dataset, and generate the input features and distance maps. Deep learning models can also be trained and tested in a web browser using free platforms such as Google Colab. We discuss how PDNET can be used to predict contacts, distance intervals, and real-valued distances.

6.
AIDS ; 34(5): 737-748, 2020 04 01.
Article in English | MEDLINE | ID: mdl-31895148

ABSTRACT

OBJECTIVE: To develop a predictive model of neurocognitive trajectories in children with perinatal HIV (pHIV). DESIGN: Machine learning analysis of baseline and longitudinal predictors derived from clinical measures utilized in pediatric HIV. METHODS: Two hundred and eighty-five children (ages 2-14 years at baseline; Mage = 6.4 years) with pHIV in Southeast Asia underwent neurocognitive assessment at study enrollment and twice annually thereafter for an average of 5.4 years. Neurocognitive slopes were modeled to establish two subgroups [above (n = 145) and below average (n = 140) trajectories). Gradient-boosted multivariate regressions (GBM) with five-fold cross validation were conducted to examine baseline (pre-ART) and longitudinal predictive features derived from demographic, HIV disease, immune, mental health, and physical health indices (i.e. complete blood count [CBC]). RESULTS: The baseline GBM established a classifier of neurocognitive group designation with an average AUC of 79% built from HIV disease severity and immune markers. GBM analysis of longitudinal predictors with and without interactions improved the average AUC to 87 and 90%, respectively. Mental health problems and hematocrit levels also emerged as salient features in the longitudinal models, with novel interactions between mental health problems and both CD4 cell count and hematocrit levels. Average AUCs derived from each GBM model were higher than results obtained using logistic regression. CONCLUSION: Our findings support the feasibility of machine learning to identify children with pHIV at risk for suboptimal neurocognitive development. Results also suggest that interactions between HIV disease and mental health problems are early antecedents to neurocognitive difficulties in later childhood among youth with pHIV.


Subject(s)
Cognition/drug effects , HIV Infections/drug therapy , HIV Infections/psychology , Infectious Disease Transmission, Vertical , Machine Learning , Psychomotor Performance/drug effects , Algorithms , CD4 Lymphocyte Count , Child , Child, Preschool , Executive Function/drug effects , Female , HIV Infections/complications , Humans , Male , Mental Health , Parturition , Pregnancy
7.
Bioinformatics ; 36(4): 1091-1098, 2020 02 15.
Article in English | MEDLINE | ID: mdl-31504181

ABSTRACT

MOTIVATION: Deep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated. RESULTS: We analyzed the results of our three deep learning-based contact prediction methods (MULTICOM-CLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction. AVAILABILITY AND IMPLEMENTATION: https://github.com/multicom-toolbox/DNCON2/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Deep Learning , Algorithms , Proteins , Sequence Alignment
8.
Proteins ; 88(6): 775-787, 2020 06.
Article in English | MEDLINE | ID: mdl-31860156

ABSTRACT

Many proteins are composed of several domains that pack together into a complex tertiary structure. Multidomain proteins can be challenging for protein structure modeling, particularly those for which templates can be found for individual domains but not for the entire sequence. In such cases, homology modeling can generate high quality models of the domains but not for the orientations between domains. Small-angle X-ray scattering (SAXS) reports the structural properties of entire proteins and has the potential for guiding homology modeling of multidomain proteins. In this article, we describe a novel multidomain protein assembly modeling method, SAXSDom that integrates experimental knowledge from SAXS with probabilistic Input-Output Hidden Markov model to assemble the structures of individual domains together. Four SAXS-based scoring functions were developed and tested, and the method was evaluated on multidomain proteins from two public datasets. Incorporation of SAXS information improved the accuracy of domain assembly for 40 out of 46 critical assessment of protein structure prediction multidomain protein targets and 45 out of 73 multidomain protein targets from the ab initio domain assembly dataset. The results demonstrate that SAXS data can provide useful information to improve the accuracy of domain-domain assembly. The source code and tool packages are available at https://github.com/jianlin-cheng/SAXSDom.


Subject(s)
Bacterial Proteins/chemistry , Caspases/chemistry , Escherichia coli Proteins/chemistry , Membrane Proteins/chemistry , Software , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Binding Sites , Caspases/genetics , Caspases/metabolism , Crystallography, X-Ray , Escherichia coli/chemistry , Escherichia coli Proteins/genetics , Escherichia coli Proteins/metabolism , Humans , Markov Chains , Membrane Proteins/genetics , Membrane Proteins/metabolism , Models, Molecular , Monte Carlo Method , Protein Binding , Protein Conformation, alpha-Helical , Protein Conformation, beta-Strand , Protein Interaction Domains and Motifs , Protein Structure, Tertiary , Rhodobacter capsulatus/chemistry , Scattering, Small Angle , Structural Homology, Protein , Thermodynamics , X-Ray Diffraction
9.
Bioinformatics ; 36(2): 470-477, 2020 01 15.
Article in English | MEDLINE | ID: mdl-31359036

ABSTRACT

MOTIVATION: Exciting new opportunities have arisen to solve the protein contact prediction problem from the progress in neural networks and the availability of a large number of homologous sequences through high-throughput sequencing. In this work, we study how deep convolutional neural networks (ConvNets) may be best designed and developed to solve this long-standing problem. RESULTS: With publicly available datasets, we designed and trained various ConvNet architectures. We tested several recent deep learning techniques including wide residual networks, dropouts and dilated convolutions. We studied the improvements in the precision of medium-range and long-range contacts, and compared the performance of our best architectures with the ones used in existing state-of-the-art methods. The proposed ConvNet architectures predict contacts with significantly more precision than the architectures used in several state-of-the-art methods. When trained using the DeepCov dataset consisting of 3456 proteins and tested on PSICOV dataset of 150 proteins, our architectures achieve up to 15% higher precision when L/2 long-range contacts are evaluated. Similarly, when trained using the DNCON2 dataset consisting of 1426 proteins and tested on 84 protein domains in the CASP12 dataset, our single network achieves 4.8% higher precision than the ensembled DNCON2 method when top L long-range contacts are evaluated. AVAILABILITY AND IMPLEMENTATION: DEEPCON is available at https://github.com/badriadhikari/DEEPCON/.


Subject(s)
Computational Biology , Neural Networks, Computer , Proteins
10.
Virol J ; 16(1): 7, 2019 01 11.
Article in English | MEDLINE | ID: mdl-30634979

ABSTRACT

BACKGROUND: Tospoviruses (genus Tospovirus, family Peribunyaviridae, order Bunyavirales) cause significant losses to a wide range of agronomic and horticultural crops worldwide. Identification and characterization of specific sequences and motifs that are critical for virus infection and pathogenicity could provide useful insights and targets for engineering virus resistance that is potentially both broad spectrum and durable. Tomato spotted wilt virus (TSWV), the most prolific member of the group, was used to better understand the structure-function relationships of the nucleocapsid gene (N), and the silencing suppressor gene (NSs), coded by the TSWV small RNA. METHODS: Using a global collection of orthotospoviral sequences, several amino acids that were conserved across the genus and the potential location of these conserved amino acid motifs in these proteins was determined. We used state of the art 3D modeling algorithms, MULTICOM-CLUSTER, MULTICOM-CONSTRUCT, MULTICOM-NOVEL, I-TASSER, ROSETTA and CONFOLD to predict the secondary and tertiary structures of the N and the NSs proteins. RESULTS: We identified nine amino acid residues in the N protein among 31 known tospoviral species, and ten amino acid residues in NSs protein among 27 tospoviral species that were conserved across the genus. For the N protein, all three algorithms gave nearly identical tertiary models. While the conserved residues were distributed throughout the protein on a linear scale, at the tertiary level, three residues were consistently located in the coil in all the models. For NSs protein models, there was no agreement among the three algorithms. However, with respect to the localization of the conserved motifs, G18 was consistently located in coil, while H115 was localized in the coil in three models. CONCLUSIONS: This is the first report of predicting the 3D structure of any tospoviral NSs protein and revealed a consistent location for two of the ten conserved residues. The modelers used gave accurate prediction for N protein allowing the localization of the conserved residues. Results form the basis for further work on the structure-function relationships of tospoviral proteins and could be useful in developing novel virus control strategies targeting the conserved residues.


Subject(s)
Molecular Conformation , Nucleocapsid Proteins/chemistry , Nucleoproteins/chemistry , Tospovirus/genetics , Amino Acid Motifs , Amino Acid Sequence , Conserved Sequence , Gene Silencing , Nucleocapsid Proteins/genetics , Nucleoproteins/genetics , RNA, Viral , Tospovirus/chemistry
SELECTION OF CITATIONS
SEARCH DETAIL