Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 67
Filter
1.
Epilepsy Behav ; 157: 109835, 2024 May 30.
Article in English | MEDLINE | ID: mdl-38820686

ABSTRACT

INTRODUCTION: Intracerebral hemorrhage represents 15 % of all strokes and it is associated with a high risk of post-stroke epilepsy. However, there are no reliable methods to accurately predict those at higher risk for developing seizures despite their importance in planning treatments, allocating resources, and advancing post-stroke seizure research. Existing risk models have limitations and have not taken advantage of readily available real-world data and artificial intelligence. This study aims to evaluate the performance of Machine-learning-based models to predict post-stroke seizures at 1 year and 5 years after an intracerebral hemorrhage in unselected patients across multiple healthcare organizations. DESIGN/METHODS: We identified patients with intracerebral hemorrhage (ICH) without a prior diagnosis of seizures from 2015 until inception (11/01/22) in the TriNetX Diamond Network, using the International Classification of Diseases, Tenth Revision (ICD-10) I61 (I61.0, I61.1, I61.2, I61.3, I61.4, I61.5, I61.6, I61.8, and I61.9). The outcome of interest was any ICD-10 diagnosis of seizures (G40/G41) at 1 year and 5 years following the first occurrence of the diagnosis of intracerebral hemorrhage. We applied a conventional logistic regression and a Light Gradient Boosted Machine (LGBM) algorithm, and the performance of the model was assessed using the area under the receiver operating characteristics (AUROC), the area under the precision-recall curve (AUPRC), the F1 statistic, model accuracy, balanced-accuracy, precision, and recall, with and without seizure medication use in the models. RESULTS: A total of 85,679 patients had an ICD-10 code of intracerebral hemorrhage and no prior diagnosis of seizures, constituting our study cohort. Seizures were present in 4.57 % and 6.27 % of patients within 1 and 5 years after ICH, respectively. At 1-year, the AUROC, AUPRC, F1 statistic, accuracy, balanced-accuracy, precision, and recall were respectively 0.7051 (standard error: 0.0132), 0.1143 (0.0068), 0.1479 (0.0055), 0.6708 (0.0076), 0.6491 (0.0114), 0.0839 (0.0032), and 0.6253 (0.0216). Corresponding metrics at 5 years were 0.694 (0.009), 0.1431 (0.0039), 0.1859 (0.0064), 0.6603 (0.0059), 0.6408 (0.0119), 0.1094 (0.0037) and 0.6186 (0.0264). These numerical values indicate that the statistical models fit the data very well. CONCLUSION: Machine learning models applied to electronic health records can improve the prediction of post-hemorrhagic stroke epilepsy, presenting a real opportunity to incorporate risk assessments into clinical decision-making in post-stroke care clinical care and improve patients' selection for post-stroke epilepsy research.

3.
medRxiv ; 2024 Jan 26.
Article in English | MEDLINE | ID: mdl-38343819

ABSTRACT

Objective: To develop an artificial intelligence, machine learning prediction model for estimating the risk of seizures 1 year and 5 years after ischemic stroke (IS) using a large dataset from Electronic Health Records. Background: Seizures are frequent after ischemic strokes and are associated with increased mortality, poor functional outcomes, and lower quality of life. Separating patients at high risk of seizures from those at low risk of seizures is needed for treatment and clinical trial planning, but remains challenging. Machine learning (ML) is a potential approach to solve this paradigm. Design/Methods: We identified patients (aged ≥18 years) with IS without a prior diagnosis of seizures from 2015 until inception (08/09/22) in the TriNetX Research Network, using the International Classification of Diseases, Tenth Revision (ICD-10) I63, excluding I63.6 (venous infarction). The outcome of interest was any ICD-10 diagnosis of seizures (G40/G41) at 1 year and 5 years following the index IS. We applied a conventional logistic regression and a Light Gradient Boosted Machine algorithm to predict the risk of seizures at 1 year and 5 years. The performance of the model was assessed using the area under the receiver operating characteristics (AUROC), the area under the precision-recall curve (AUPRC), F1 statistic, model accuracy, balanced accuracy, precision, and recall, with and without anti-seizure medication use in the models. Results: Our study cohort included 430,254 IS patients. Seizures were present in 18,502 (4.3%) and (5.3%) patients within 1 and 5 years after IS, respectively. At 1-year, the AUROC, AUPRC, F1 statistic, accuracy, balanced-accuracy, precision, and recall were respectively 0.7854 (standard error: 0.0038), 0.2426 (0.0048), 0.2299 (0.0034), 0.8236 (0.001), 0.7226 (0.0049), 0.1415 (0.0021), and 0.6122, (0.0095). Corresponding metrics at 5 years were 0.7607 (0.0031), 0.247 (0.0064), 0.2441 (0.0032), 0.8125 (0.0013), 0.7001 (0.0045), 0.155 (0.002) and 0.5745 (0.0095). Conclusion: Our findings suggest that ML models show good model performance for predicting seizures after IS.

4.
Sleep Health ; 9(5): 596-610, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37573208

ABSTRACT

GOAL AND AIMS: Commonly used actigraphy algorithms are designed to operate within a known in-bed interval. However, in free-living scenarios this interval is often unknown. We trained and evaluated a sleep/wake classifier that operates on actigraphy over ∼24-hour intervals, without knowledge of in-bed timing. FOCUS TECHNOLOGY: Actigraphy counts from ActiWatch Spectrum devices. REFERENCE TECHNOLOGY: Sleep staging derived from polysomnography, supplemented by observation of wakefulness outside of the staged interval. Classifications from the Oakley actigraphy algorithm were additionally used as performance reference. SAMPLE: Adults, sleeping in either a home or laboratory environment. DESIGN: Machine learning was used to train and evaluate a sleep/wake classifier in a supervised learning paradigm. The classifier is a temporal convolutional network, a form of deep neural network. CORE ANALYTICS: Performance was evaluated across ∼24 hours, and additionally restricted to only in-bed intervals, both in terms of epoch-by-epoch performance, and the discrepancy of summary statistics within the intervals. ADDITIONAL ANALYTICS AND EXPLORATORY ANALYSES: Performance of the trained model applied to the Multi-Ethnic Study of Atherosclerosis dataset. CORE OUTCOMES: Over ∼24 hours, the temporal convolutional network classifier produced the same or better performance as the Oakley classifier on all measures tested. When restricting analysis to the in-bed interval, the temporal convolutional network remained favorable on several metrics. IMPORTANT SUPPLEMENTAL OUTCOMES: Performance decreased on the Multi-Ethnic Study of Atherosclerosis dataset, especially when restricting analysis to the in-bed interval. CORE CONCLUSION: A classifier using data labeled over ∼24-hour intervals allows for the continuous classification of sleep/wake without knowledge of in-bed intervals. Further development should focus on improving generalization performance.


Subject(s)
Actigraphy , Atherosclerosis , Adult , Humans , Sleep , Polysomnography , Rest
5.
IEEE J Biomed Health Inform ; 27(7): 3645-3656, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37115836

ABSTRACT

The increasing reliance on online communities for healthcare information by patients and caregivers has led to the increase in the spread of misinformation, or subjective, anecdotal and inaccurate or non-specific recommendations, which, if acted on, could cause serious harm to the patients. Hence, there is an urgent need to connect users with accurate and tailored health information in a timely manner to prevent such harm. This article proposes an innovative approach to suggesting reliable information to participants in online communities as they move through different stages in their disease or treatment. We hypothesize that patients with similar histories of disease progression or course of treatment would have similar information needs at comparable stages. Specifically, we pose the problem of predicting topic tags or keywords that describe the future information needs of users based on their profiles, traces of their online interactions within the community (past posts, replies) and the profiles and traces of online interactions of other users with similar profiles and similar traces of past interaction with the target users. The result is a variant of the collaborative information filtering or recommendation system tailored to the needs of users of online health communities. We report results of our experiments on two unique datasets from two different social media platforms which demonstrates the superiority of the proposed approach over the state of the art baselines with respect to accurate and timely prediction of topic tags (and hence information sources of interest).


Subject(s)
Consumer Health Information , Social Media , Humans
6.
Biomolecules ; 13(1)2023 01 06.
Article in English | MEDLINE | ID: mdl-36671507

ABSTRACT

Protein-protein interactions play a ubiquitous role in biological function. Knowledge of the three-dimensional (3D) structures of the complexes they form is essential for understanding the structural basis of those interactions and how they orchestrate key cellular processes. Computational docking has become an indispensable alternative to the expensive and time-consuming experimental approaches for determining the 3D structures of protein complexes. Despite recent progress, identifying near-native models from a large set of conformations sampled by docking-the so-called scoring problem-still has considerable room for improvement. We present MetaScore, a new machine-learning-based approach to improve the scoring of docked conformations. MetaScore utilizes a random forest (RF) classifier trained to distinguish near-native from non-native conformations using their protein-protein interfacial features. The features include physicochemical properties, energy terms, interaction-propensity-based features, geometric properties, interface topology features, evolutionary conservation, and also scores produced by traditional scoring functions (SFs). MetaScore scores docked conformations by simply averaging the score produced by the RF classifier with that produced by any traditional SF. We demonstrate that (i) MetaScore consistently outperforms each of the nine traditional SFs included in this work in terms of success rate and hit rate evaluated over conformations ranked among the top 10; (ii) an ensemble method, MetaScore-Ensemble, that combines 10 variants of MetaScore obtained by combining the RF score with each of the traditional SFs outperforms each of the MetaScore variants. We conclude that the performance of traditional SFs can be improved upon by using machine learning to judiciously leverage protein-protein interfacial features and by using ensemble methods to combine multiple scoring functions.


Subject(s)
Machine Learning , Proteins , Proteins/chemistry , Protein Binding , Ligands , Protein Conformation
7.
Health Place ; 77: 102891, 2022 09.
Article in English | MEDLINE | ID: mdl-35970068

ABSTRACT

Biweekly county COVID-19 data were linked with Longitudinal Employer-Household Dynamics data to analyze population risk exposures enabled by pre-pandemic, country-wide commuter networks. Results from fixed-effects, spatial, and computational statistical approaches showed that commuting network exposure to COVID-19 predicted an area's COVID-19 cases and deaths, indicating spillovers. Commuting spillovers between counties were independent from geographic contiguity, pandemic-time mobility, or social media ties. Results suggest that commuting connections form enduring social linkages with effects on health that can withstand mobility disruptions. Findings contribute to a growing relational view of health and place, with implications for neighborhood effects research and place-based policies.


Subject(s)
COVID-19 , Social Media , COVID-19/epidemiology , Humans , Pandemics , Residence Characteristics , Transportation
8.
Netw Neurosci ; 6(1): 29-48, 2022 Feb.
Article in English | MEDLINE | ID: mdl-35350584

ABSTRACT

In this critical review, we examine the application of predictive models, for example, classifiers, trained using machine learning (ML) to assist in interpretation of functional neuroimaging data. Our primary goal is to summarize how ML is being applied and critically assess common practices. Our review covers 250 studies published using ML and resting-state functional MRI (fMRI) to infer various dimensions of the human functional connectome. Results for holdout ("lockbox") performance was, on average, ∼13% less accurate than performance measured through cross-validation alone, highlighting the importance of lockbox data, which was included in only 16% of the studies. There was also a concerning lack of transparency across the key steps in training and evaluating predictive models. The summary of this literature underscores the importance of the use of a lockbox and highlights several methodological pitfalls that can be addressed by the imaging community. We argue that, ideally, studies are motivated both by the reproducibility and generalizability of findings as well as the potential clinical significance of the insights. We offer recommendations for principled integration of machine learning into the clinical neurosciences with the goal of advancing imaging biomarkers of brain disorders, understanding causative determinants for health risks, and parsing heterogeneous patient outcomes.

9.
SoftwareX ; 112020.
Article in English | MEDLINE | ID: mdl-35419466

ABSTRACT

Computational docking is a promising tool to model three-dimensional (3D) structures of protein-protein complexes, which provides fundamental insights of protein functions in the cellular life. Singling out near-native models from the huge pool of generated docking models (referred to as the scoring problem) remains as a major challenge in computational docking. We recently published iScore, a novel graph kernel based scoring function. iScore ranks docking models based on their interface graph similarities to the training interface graph set. iScore uses a support vector machine approach with random-walk graph kernels to classify and rank protein-protein interfaces. Here, we present the software for iScore. The software provides executable scripts that fully automate the computational workflow. In addition, the creation and analysis of the interface graph can be distributed across different processes using Message Passing interface (MPI) and can be offloaded to GPUs thanks to dedicated CUDA kernels.

10.
Bioinformatics ; 36(1): 112-121, 2020 01 01.
Article in English | MEDLINE | ID: mdl-31199455

ABSTRACT

MOTIVATION: Protein complexes play critical roles in many aspects of biological functions. Three-dimensional (3D) structures of protein complexes are critical for gaining insights into structural bases of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determinations of 3D protein complex structures, computational docking has evolved as a valuable tool to predict 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge. RESULTS: Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein-protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to, that of state-of-the-art scoring functions on two independent datasets: (i) Docking software-specific models and (ii) the CAPRI score set generated by a wide variety of docking approaches (i.e. docking software-non-specific). iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary, topological and energetic information for scoring docked conformations. This work represents the first successful demonstration of graph kernels to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes. AVAILABILITY AND IMPLEMENTATION: The iScore code is freely available from Github: https://github.com/DeepRank/iScore (DOI: 10.5281/zenodo.2630567). And the docking models used are available from SBGrid: https://data.sbgrid.org/dataset/684). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computational Biology , Molecular Docking Simulation , Proteins , Computational Biology/methods , Molecular Docking Simulation/methods , Protein Binding , Protein Conformation , Proteins/chemistry , Proteins/metabolism , Software
11.
Nat Sci Sleep ; 11: 387-399, 2019.
Article in English | MEDLINE | ID: mdl-31849551

ABSTRACT

BACKGROUND: The current gold standard for measuring sleep is polysomnography (PSG), but it can be obtrusive and costly. Actigraphy is a relatively low-cost and unobtrusive alternative to PSG. Of particular interest in measuring sleep from actigraphy is prediction of sleep-wake states. Current literature on prediction of sleep-wake states from actigraphy consists of methods that use population data, which we call generalized models. However, accounting for variability of sleep patterns across individuals calls for personalized models of sleep-wake states prediction that could be potentially better suited to individual-level data and yield more accurate estimation of sleep. PURPOSE: To investigate the validity of developing personalized machine learning models, trained and tested on individual-level actigraphy data, for improved prediction of sleep-wake states and reliable estimation of nightly sleep parameters. PARTICIPANTS AND METHODS: We used a dataset including 54 participants and systematically trained and tested 5 different personalized machine learning models as well as their generalized counterparts. We evaluated model performance compared to concurrent PSG through extensive machine learning experiments and statistical analyses. RESULTS: Our experiments show the superiority of personalized models over their generalized counterparts in estimating PSG-derived sleep parameters. Personalized models of regularized logistic regression, random forest, adaptive boosting, and extreme gradient boosting achieve estimates of total sleep time, wake after sleep onset, sleep efficiency, and number of awakenings that are closer to those obtained by PSG, in absolute difference, than the same estimates from their generalized counterparts. We further show that the difference between estimates of sleep parameters obtained by personalized models and those of PSG is statistically non-significant. CONCLUSION: Personalized machine learning models of sleep-wake states outperform their generalized counterparts in terms of estimating sleep parameters and are indistinguishable from PSG labeled sleep-wake states. Personalized machine learning models can be used in actigraphy studies of sleep health and potentially screening for some sleep disorders.

12.
PLoS One ; 14(11): e0225382, 2019.
Article in English | MEDLINE | ID: mdl-31756219

ABSTRACT

Reliable identification of Inflammatory biomarkers from metagenomics data is a promising direction for developing non-invasive, cost-effective, and rapid clinical tests for early diagnosis of IBD. We present an integrative approach to Network-Based Biomarker Discovery (NBBD) which integrates network analyses methods for prioritizing potential biomarkers and machine learning techniques for assessing the discriminative power of the prioritized biomarkers. Using a large dataset of new-onset pediatric IBD metagenomics biopsy samples, we compare the performance of Random Forest (RF) classifiers trained on features selected using a representative set of traditional feature selection methods against NBBD framework, configured using five different tools for inferring networks from metagenomics data, and nine different methods for prioritizing biomarkers as well as a hybrid approach combining best traditional and NBBD based feature selection. We also examine how the performance of the predictive models for IBD diagnosis varies as a function of the size of the data used for biomarker identification. Our results show that (i) NBBD is competitive with some of the state-of-the-art feature selection methods including Random Forest Feature Importance (RFFI) scores; and (ii) NBBD is especially effective in reliably identifying IBD biomarkers when the number of data samples available for biomarker discovery is small.


Subject(s)
Biomarkers/analysis , Inflammatory Bowel Diseases/microbiology , Metagenomics/methods , Algorithms , Humans , Inflammatory Bowel Diseases/metabolism , Machine Learning , Models, Theoretical
14.
Proteins ; 87(3): 198-211, 2019 03.
Article in English | MEDLINE | ID: mdl-30536635

ABSTRACT

RNA-protein interactions play essential roles in regulating gene expression. While some RNA-protein interactions are "specific", that is, the RNA-binding proteins preferentially bind to particular RNA sequence or structural motifs, others are "non-RNA specific." Deciphering the protein-RNA recognition code is essential for comprehending the functional implications of these interactions and for developing new therapies for many diseases. Because of the high cost of experimental determination of protein-RNA interfaces, there is a need for computational methods to identify RNA-binding residues in proteins. While most of the existing computational methods for predicting RNA-binding residues in RNA-binding proteins are oblivious to the characteristics of the partner RNA, there is growing interest in methods for partner-specific prediction of RNA binding sites in proteins. In this work, we assess the performance of two recently published partner-specific protein-RNA interface prediction tools, PS-PRIP, and PRIdictor, along with our own new tools. Specifically, we introduce a novel metric, RNA-specificity metric (RSM), for quantifying the RNA-specificity of the RNA binding residues predicted by such tools. Our results show that the RNA-binding residues predicted by previously published methods are oblivious to the characteristics of the putative RNA binding partner. Moreover, when evaluated using partner-agnostic metrics, RNA partner-specific methods are outperformed by the state-of-the-art partner-agnostic methods. We conjecture that either (a) the protein-RNA complexes in PDB are not representative of the protein-RNA interactions in nature, or (b) the current methods for partner-specific prediction of RNA-binding residues in proteins fail to account for the differences in RNA partner-specific versus partner-agnostic protein-RNA interactions, or both.


Subject(s)
Computational Biology , Proteins/chemistry , RNA-Binding Proteins/genetics , RNA/genetics , Amino Acid Sequence/genetics , Base Sequence/genetics , Binding Sites/genetics , Models, Molecular , Protein Binding/genetics , Protein Conformation , Proteins/genetics , RNA/chemistry , RNA-Binding Motifs/genetics , RNA-Binding Proteins/chemistry , Sequence Analysis, Protein , Software
15.
BMC Med Genomics ; 11(Suppl 3): 71, 2018 Sep 14.
Article in English | MEDLINE | ID: mdl-30255801

ABSTRACT

BACKGROUND: Large-scale collaborative precision medicine initiatives (e.g., The Cancer Genome Atlas (TCGA)) are yielding rich multi-omics data. Integrative analyses of the resulting multi-omics data, such as somatic mutation, copy number alteration (CNA), DNA methylation, miRNA, gene expression, and protein expression, offer tantalizing possibilities for realizing the promise and potential of precision medicine in cancer prevention, diagnosis, and treatment by substantially improving our understanding of underlying mechanisms as well as the discovery of novel biomarkers for different types of cancers. However, such analyses present a number of challenges, including heterogeneity, and high-dimensionality of omics data. METHODS: We propose a novel framework for multi-omics data integration using multi-view feature selection. We introduce a novel multi-view feature selection algorithm, MRMR-mv, an adaptation of the well-known Min-Redundancy and Maximum-Relevance (MRMR) single-view feature selection algorithm to the multi-view setting. RESULTS: We report results of experiments using an ovarian cancer multi-omics dataset derived from the TCGA database on the task of predicting ovarian cancer survival. Our results suggest that multi-view models outperform both view-specific models (i.e., models trained and tested using a single type of omics data) and models based on two baseline data fusion methods. CONCLUSIONS: Our results demonstrate the potential of multi-view feature selection in integrative analyses and predictive modeling from multi-omics data.


Subject(s)
Algorithms , Biomarkers, Tumor/genetics , Computational Biology/methods , DNA Copy Number Variations , DNA Methylation , Ovarian Neoplasms/mortality , Transcriptome , Female , Gene Expression Profiling , High-Throughput Nucleotide Sequencing/methods , Humans , Ovarian Neoplasms/genetics , Prognosis , Survival Rate
16.
Brief Bioinform ; 18(3): 458-466, 2017 05 01.
Article in English | MEDLINE | ID: mdl-27013645

ABSTRACT

Although many advanced and sophisticated ab initio approaches for modeling protein-protein complexes have been proposed in past decades, template-based modeling (TBM) remains the most accurate and widely used approach, given a reliable template is available. However, there are many different ways to exploit template information in the modeling process. Here, we systematically evaluate and benchmark a TBM method that uses conserved interfacial residue pairs as docking distance restraints [referred to as alpha carbon-alpha carbon (CA-CA)-guided docking]. We compare it with two other template-based protein-protein modeling approaches, including a conserved non-pairwise interfacial residue restrained docking approach [referred to as the ambiguous interaction restraint (AIR)-guided docking] and a simple superposition-based modeling approach. Our results show that, for most cases, the CA-CA-guided docking method outperforms both superposition with refinement and the AIR-guided docking method. We emphasize the superiority of the CA-CA-guided docking on cases with medium to large conformational changes, and interactions mediated through loops, tails or disordered regions. Our results also underscore the importance of a proper refinement of superimposition models to reduce steric clashes. In summary, we provide a benchmarked TBM protocol that uses conserved pairwise interface distance as restraints in generating realistic 3D protein-protein interaction models, when reliable templates are available. The described CA-CA-guided docking protocol is based on the HADDOCK platform, which allows users to incorporate additional prior knowledge of the target system to further improve the quality of the resulting models.


Subject(s)
Proteins/metabolism , Models, Molecular , Protein Binding
17.
Methods Mol Biol ; 1484: 205-235, 2017.
Article in English | MEDLINE | ID: mdl-27787829

ABSTRACT

Identifying individual residues in the interfaces of protein-RNA complexes is important for understanding the molecular determinants of protein-RNA recognition and has many potential applications. Recent technical advances have led to several high-throughput experimental methods for identifying partners in protein-RNA complexes, but determining RNA-binding residues in proteins is still expensive and time-consuming. This chapter focuses on available computational methods for identifying which amino acids in an RNA-binding protein participate directly in contacting RNA. Step-by-step protocols for using three different web-based servers to predict RNA-binding residues are described. In addition, currently available web servers and software tools for predicting RNA-binding sites, as well as databases that contain valuable information about known protein-RNA complexes, RNA-binding motifs in proteins, and protein-binding recognition sites in RNA are provided. We emphasize sequence-based methods that can reliably identify interfacial residues without the requirement for structural information regarding either the RNA-binding protein or its RNA partner.


Subject(s)
Proteins/genetics , RNA-Binding Proteins/genetics , Software , Algorithms , Amino Acid Sequence/genetics , Binding Sites , Computational Biology , Protein Binding , Proteins/chemistry , RNA-Binding Proteins/chemistry
18.
Methods Mol Biol ; 1484: 255-264, 2017.
Article in English | MEDLINE | ID: mdl-27787831

ABSTRACT

Antibody-protein interactions play a critical role in the humoral immune response. B-cells secrete antibodies, which bind antigens (e.g., cell surface proteins of pathogens). The specific parts of antigens that are recognized by antibodies are called B-cell epitopes. These epitopes can be linear, corresponding to a contiguous amino acid sequence fragment of an antigen, or conformational, in which residues critical for recognition may not be contiguous in the primary sequence, but are in close proximity within the folded protein 3D structure.Identification of B-cell epitopes in target antigens is one of the key steps in epitope-driven subunit vaccine design, immunodiagnostic tests, and antibody production. In silico bioinformatics techniques offer a promising and cost-effective approach for identifying potential B-cell epitopes in a target vaccine candidate. In this chapter, we show how to utilize online B-cell epitope prediction tools to identify linear B-cell epitopes from the primary amino acid sequence of proteins.


Subject(s)
Computational Biology/methods , Epitope Mapping/methods , Proteins/genetics , Amino Acid Sequence/genetics , Antibodies/genetics , Antibodies/immunology , Antigens/genetics , Antigens/immunology , B-Lymphocytes/immunology , Computer Simulation , Epitopes/genetics , Epitopes/immunology , Proteins/chemistry , Proteins/immunology
19.
Proteomics ; 16(23): 2967-2976, 2016 12.
Article in English | MEDLINE | ID: mdl-27714937

ABSTRACT

Accurate and comprehensive identification of surface-exposed proteins (SEPs) in parasites is a key step in developing novel subunit vaccines. However, the reliability of MS-based high-throughput methods for proteome-wide mapping of SEPs continues to be limited due to high rates of false positives (i.e., proteins mistakenly identified as surface exposed) as well as false negatives (i.e., SEPs not detected due to low expression or other technical limitations). We propose a framework called PlasmoSEP for the reliable identification of SEPs using a novel semisupervised learning algorithm that combines SEPs identified by high-throughput experiments and expert annotation of high-throughput data to augment labeled data for training a predictive model. Our experiments using high-throughput data from the Plasmodium falciparum surface-exposed proteome provide several novel high-confidence predictions of SEPs in P. falciparum and also confirm expert annotations for several others. Furthermore, PlasmoSEP predicts that 25 of 37 experimentally identified SEPs in Plasmodium yoelii salivary gland sporozoites are likely to be SEPs. Finally, PlasmoSEP predicts several novel SEPs in P. yoelii and Plasmodium vivax malaria parasites that can be validated for further vaccine studies. Our computational framework can be easily adapted to improve the interpretation of data from high-throughput studies.


Subject(s)
Algorithms , Membrane Proteins/analysis , Plasmodium falciparum/chemistry , Proteomics/methods , Protozoan Proteins/analysis , High-Throughput Screening Assays/methods , Humans , Membrane Proteins/metabolism , Models, Theoretical , Plasmodium vivax/metabolism , Plasmodium vivax/pathogenicity , Plasmodium yoelii/chemistry , Protozoan Proteins/metabolism , Salivary Glands/metabolism
20.
PLoS One ; 11(7): e0158445, 2016.
Article in English | MEDLINE | ID: mdl-27383535

ABSTRACT

A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.


Subject(s)
Computational Biology , Proteins/chemistry , RNA/chemistry , Software , Algorithms , Artificial Intelligence , Computers , Databases, Protein , Models, Molecular , Position-Specific Scoring Matrices , Predictive Value of Tests , Protein Conformation , Protein Interaction Mapping , Proteins/metabolism , RNA/metabolism , Sequence Analysis, Protein
SELECTION OF CITATIONS
SEARCH DETAIL
...