Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
2.
Sci Rep ; 14(1): 8252, 2024 04 08.
Article in English | MEDLINE | ID: mdl-38589418

ABSTRACT

Even though in silico drug ligand-based methods have been successful in predicting interactions with known target proteins, they struggle with new, unassessed targets. To address this challenge, we propose an approach that integrates structural data from AlphaFold 2 predicted protein structures into machine learning models. Our method extracts 3D structural protein fingerprints and combines them with ligand structural data to train a single machine learning model. This model captures the relationship between ligand properties and the unique structural features of various target proteins, enabling predictions for never before tested molecules and protein targets. To assess our model, we used a dataset of 144 Human G-protein Coupled Receptors (GPCRs) with over 140,000 measured inhibition constants (Ki) values. Results strongly suggest that our approach performs as well as state-of-the-art ligand-based methods. In a second modeling approach that used 129 targets for training and a separate test set of 15 different protein targets, our model correctly predicted interactions for 73% of targets, with explained variances exceeding 0.50 in 22% of cases. Our findings further verified that the usage of experimentally determined protein structures produced models that were statistically indistinct from the Alphafold synthetic structures. This study presents a proteo-chemometric drug screening approach that uses a simple and scalable method for extracting protein structural information for usage in machine learning models capable of predicting protein-molecule interactions even for orphan targets.


Subject(s)
Machine Learning , Receptors, G-Protein-Coupled , Humans , Ligands , Receptors, G-Protein-Coupled/chemistry
3.
Cancers (Basel) ; 14(14)2022 Jul 19.
Article in English | MEDLINE | ID: mdl-35884571

ABSTRACT

The epidermal growth factor receptor (EGFR) is upregulated in glioblastoma, becoming an attractive therapeutic target. However, activation of compensatory pathways generates inputs to downstream PI3Kp110ß signaling, leading to anti-EGFR therapeutic resistance. Moreover, the blood-brain barrier (BBB) limits drugs' brain penetration. We aimed to discover EGFR/PI3Kp110ß pathway inhibitors for a multi-targeting approach, with favorable ADMET and BBB-permeant properties. We used quantitative structure-activity relationship models and structure-based virtual screening, and assessed ADMET properties, to identify BBB-permeant drug candidates. Predictions were validated in in vitro models of the human BBB and BBB-glioma co-cultures. The results disclosed 27 molecules (18 EGFR, 6 PI3Kp110ß, and 3 dual inhibitors) for biological validation, performed in two glioblastoma cell lines (U87MG and U87MG overexpressing EGFR). Six molecules (two EGFR, two PI3Kp110ß, and two dual inhibitors) decreased cell viability by 40-99%, with the greatest effect observed for the dual inhibitors. The glioma cytotoxicity was confirmed by analysis of targets' downregulation and increased apoptosis (15-85%). Safety to BBB endothelial cells was confirmed for three of those molecules (one EGFR and two PI3Kp110ß inhibitors). These molecules crossed the endothelial monolayer in the BBB in vitro model and in the BBB-glioblastoma co-culture system. These results revealed novel drug candidates for glioblastoma treatment.

4.
Sci Rep ; 11(1): 22223, 2021 11 15.
Article in English | MEDLINE | ID: mdl-34782688

ABSTRACT

Cystic fibrosis (CF) is a life-threatening autosomal recessive disease caused by more than 2100 mutations in the CF transmembrane conductance regulator (CFTR) gene, generating variability in disease severity among individuals with CF sharing the same CFTR genotype. Systems biology can assist in the collection and visualization of CF data to extract additional biological significance and find novel therapeutic targets. Here, we present the CyFi-MAP-a disease map repository of CFTR molecular mechanisms and pathways involved in CF. Specifically, we represented the wild-type (wt-CFTR) and the F508del associated processes (F508del-CFTR) in separate submaps, with pathways related to protein biosynthesis, endoplasmic reticulum retention, export, activation/inactivation of channel function, and recycling/degradation after endocytosis. CyFi-MAP is an open-access resource with specific, curated and continuously updated information on CFTR-related pathways available online at https://cysticfibrosismap.github.io/ . This tool was developed as a reference CF pathway data repository to be continuously updated and used worldwide in CF research.


Subject(s)
Biomarkers , Cystic Fibrosis/etiology , Cystic Fibrosis/metabolism , Databases, Genetic , Disease Susceptibility , Signal Transduction , Cystic Fibrosis Transmembrane Conductance Regulator/genetics , Cystic Fibrosis Transmembrane Conductance Regulator/metabolism , Humans , Software , Web Browser
5.
Molecules ; 24(9)2019 Apr 30.
Article in English | MEDLINE | ID: mdl-31052325

ABSTRACT

The performance of quantitative structure-activity relationship (QSAR) models largely depends on the relevance of the selected molecular representation used as input data matrices. This work presents a thorough comparative analysis of two main categories of molecular representations (vector space and metric space) for fitting robust machine learning models in QSAR problems. For the assessment of these methods, seven different molecular representations that included RDKit descriptors, five different fingerprints types (MACCS, PubChem, FP2-based, Atom Pair, and ECFP4), and a graph matching approach (non-contiguous atom matching structure similarity; NAMS) in both vector space and metric space, were subjected to state-of-art machine learning methods that included different dimensionality reduction methods (feature selection and linear dimensionality reduction). Five distinct QSAR data sets were used for direct assessment and analysis. Results show that, in general, metric-space and vector-space representations are able to produce equivalent models, but there are significant differences between individual approaches. The NAMS-based similarity approach consistently outperformed most fingerprint representations in model quality, closely followed by Atom Pair fingerprints. To further verify these findings, the metric space-based models were fitted to the same data sets with the closest neighbors removed. These latter results further strengthened the above conclusions. The metric space graph-based approach appeared significantly superior to the other representations, albeit at a significant computational cost.


Subject(s)
Models, Molecular , Quantitative Structure-Activity Relationship , Support Vector Machine , Algorithms , Computer Simulation , Machine Learning
6.
Cells ; 8(4)2019 04 14.
Article in English | MEDLINE | ID: mdl-31014000

ABSTRACT

The most common cystic fibrosis-causing mutation (F508del, present in ~85% of CF patients) leads to CFTR misfolding, which is recognized by the endoplasmic reticulum (ER) quality control (ERQC), resulting in ER retention and early degradation. It is known that CFTR exit from the ER is mediated by specific retention/sorting signals that include four arginine-framed tripeptide (AFT) retention motifs and a diacidic (DAD) exit code that controls the interaction with the COPII machinery. Here, we aim at obtaining a global view of the protein interactors that regulate CFTR exit from the ER. We used mass spectrometry-based interaction proteomics and bioinformatics analyses to identify and characterize proteins interacting with selected CFTR peptide motifs or full-length CFTR variants retained or bypassing these ERQC checkpoints. We conclude that these ERQC trafficking checkpoints rely on fundamental players in the secretory pathway, detecting key components of the protein folding machinery associated with the AFT recognition and of the trafficking machinery recognizing the diacidic code. Furthermore, a greater similarity in terms of interacting proteins is observed for variants sharing the same folding defect over those reaching the same cellular location, evidencing that folding status is dominant over ER escape in shaping the CFTR interactome.


Subject(s)
Cystic Fibrosis Transmembrane Conductance Regulator/metabolism , Cystic Fibrosis/metabolism , Cell Line , Cystic Fibrosis Transmembrane Conductance Regulator/chemistry , Endoplasmic Reticulum/metabolism , Endoplasmic Reticulum Stress , Humans , Mutation , Protein Folding , Protein Transport , Proteomics , Respiratory Mucosa/metabolism
7.
J Cheminform ; 11(1): 63, 2019 Oct 22.
Article in English | MEDLINE | ID: mdl-33430986

ABSTRACT

BACKGROUND: Molecular space visualization can help to explore the diversity of large heterogeneous chemical data, which ultimately may increase the understanding of structure-activity relationships (SAR) in drug discovery projects. Visual SAR analysis can therefore be useful for library design, chemical classification for their biological evaluation and virtual screening for the selection of compounds for synthesis or in vitro testing. As such, computational approaches for molecular space visualization have become an important issue in cheminformatics research. The proposed approach uses molecular similarity as the sole input for computing a probabilistic surface of molecular activity (PSMA). This similarity matrix is transformed in 2D using different dimension reduction algorithms (Principal Coordinates Analysis ( PCooA), Kruskal multidimensional scaling, Sammon mapping and t-SNE). From this projection, a kernel density function is applied to compute the probability of activity for each coordinate in the new projected space. RESULTS: This methodology was tested over four different quantitative structure-activity relationship (QSAR) binary classification data sets and the PSMAs were computed for each. The generated maps showed internal consistency with active molecules grouped together for all data sets and all dimensionality reduction algorithms. To validate the quality of the generated maps, the 2D coordinates of test molecules were computed into the new reference space using a data transformation matrix. In total sixteen PSMAs were built, and their performance was assessed using the Area Under Curve (AUC) and the Matthews Coefficient Correlation (MCC). For the best projections for each data set, AUC testing results ranged from 0.87 to 0.98 and the MCC scores ranged from 0.33 to 0.77, suggesting this methodology can validly capture the complexities of the molecular activity space. All four mapping functions provided generally good results yet the overall performance of PCooA and t-SNE was slightly better than Sammon mapping and Kruskal multidimensional scaling. CONCLUSIONS: Our result showed that by using an appropriate combination of metric space representation and dimensionality reduction applied over metric spaces it is possible to produce a visual PSMA for which its consistency has been validated by using this map as a classification model. The produced maps can be used as prediction tools as it is simple to project any molecule into this new reference space as long as the similarities to the molecules used to compute the initial similarity matrix can be computed.

8.
Cell Mol Life Sci ; 75(24): 4495-4509, 2018 Dec.
Article in English | MEDLINE | ID: mdl-30066085

ABSTRACT

Misfolded F508del-CFTR, the main molecular cause of the recessive disorder cystic fibrosis, is recognized by the endoplasmic reticulum (ER) quality control (ERQC) resulting in its retention and early degradation. The ERQC mechanisms rely mainly on molecular chaperones and on sorting motifs, whose presence and exposure determine CFTR retention or exit through the secretory pathway. Arginine-framed tripeptides (AFTs) are ER retention motifs shown to modulate CFTR retention. However, the interactions and regulatory pathways involved in this process are still largely unknown. Here, we used proteomic interaction profiling and global bioinformatic analysis to identify factors that interact differentially with F508del-CFTR and F508del-CFTR without AFTs (F508del-4RK-CFTR) as putative regulators of this specific ERQC checkpoint. Using LC-MS/MS, we identified kinesin family member C1 (KIFC1) as a stronger interactor with F508del-CFTR versus F508del-4RK-CFTR. We further validated this interaction showing that decreasing KIFC1 levels or activity stabilizes the immature form of F508del-CFTR by reducing its degradation. We conclude that the current approach is able to identify novel putative therapeutic targets that can be ultimately used to the benefit of CF patients.


Subject(s)
Cystic Fibrosis Transmembrane Conductance Regulator/metabolism , Kinesins/metabolism , Protein Interaction Maps , Proteomics/methods , Amino Acid Sequence , Cystic Fibrosis Transmembrane Conductance Regulator/chemistry , Cystic Fibrosis Transmembrane Conductance Regulator/genetics , Down-Regulation , HEK293 Cells , Humans , Kinesins/genetics , Mutation , Protein Folding , Protein Interaction Mapping/methods , Proteolysis
9.
J Cheminform ; 10(1): 1, 2018 Jan 16.
Article in English | MEDLINE | ID: mdl-29340790

ABSTRACT

BACKGROUND: In-silico quantitative structure-activity relationship (QSAR) models based tools are widely used to screen huge databases of compounds in order to determine the biological properties of chemical molecules based on their chemical structure. With the passage of time, the exponentially growing amount of synthesized and known chemicals data demands computationally efficient automated QSAR modeling tools, available to researchers that may lack extensive knowledge of machine learning modeling. Thus, a fully automated and advanced modeling platform can be an important addition to the QSAR community. RESULTS: In the presented workflow the process from data preparation to model building and validation has been completely automated. The most critical modeling tasks (data curation, data set characteristics evaluation, variable selection and validation) that largely influence the performance of QSAR models were focused. It is also included the ability to quickly evaluate the feasibility of a given data set to be modeled. The developed framework is tested on data sets of thirty different problems. The best-optimized feature selection methodology in the developed workflow is able to remove 62-99% of all redundant data. On average, about 19% of the prediction error was reduced by using feature selection producing an increase of 49% in the percentage of variance explained (PVE) compared to models without feature selection. Selecting only the models with a modelability score above 0.6, average PVE scores were 0.71. A strong correlation was verified between the modelability scores and the PVE of the models produced with variable selection. CONCLUSIONS: We developed an extendable and highly customizable fully automated QSAR modeling framework. This designed workflow does not require any advanced parameterization nor depends on users decisions or expertise in machine learning/programming. With just a given target or problem, the workflow follows an unbiased standard protocol to develop reliable QSAR models by directly accessing online manually curated databases or by using private data sets. The other distinctive features of the workflow include prior estimation of data modelability to avoid time-consuming modeling trials for non modelable data sets, an efficient variable selection procedure and the facility of output availability at each modeling task for the diverse application and reproduction of historical predictions. The results reached on a selection of thirty QSAR problems suggest that the approach is capable of building reliable models even for challenging problems.

11.
Nat Commun ; 7: 12460, 2016 08 23.
Article in English | MEDLINE | ID: mdl-27549343

ABSTRACT

Rheumatoid arthritis (RA) affects millions world-wide. While anti-TNF treatment is widely used to reduce disease progression, treatment fails in ∼one-third of patients. No biomarker currently exists that identifies non-responders before treatment. A rigorous community-based assessment of the utility of SNP data for predicting anti-TNF treatment efficacy in RA patients was performed in the context of a DREAM Challenge (http://www.synapse.org/RA_Challenge). An open challenge framework enabled the comparative evaluation of predictions developed by 73 research groups using the most comprehensive available data and covering a wide range of state-of-the-art modelling methodologies. Despite a significant genetic heritability estimate of treatment non-response trait (h(2)=0.18, P value=0.02), no significant genetic contribution to prediction accuracy is observed. Results formally confirm the expectations of the rheumatology community that SNP information does not significantly improve predictive performance relative to standard clinical traits, thereby justifying a refocusing of future efforts on collection of other data.


Subject(s)
Antibodies, Monoclonal, Humanized/therapeutic use , Arthritis, Rheumatoid/drug therapy , Genetic Predisposition to Disease/genetics , Polymorphism, Single Nucleotide , Tumor Necrosis Factor-alpha/antagonists & inhibitors , Adult , Aged , Antibodies, Monoclonal/therapeutic use , Antirheumatic Agents/therapeutic use , Arthritis, Rheumatoid/genetics , Arthritis, Rheumatoid/pathology , Certolizumab Pegol/therapeutic use , Cohort Studies , Crowdsourcing , Female , Humans , Male , Middle Aged , Prognosis , Treatment Outcome , Tumor Necrosis Factor-alpha/immunology
12.
Genomics ; 106(5): 268-77, 2015 Nov.
Article in English | MEDLINE | ID: mdl-26225835

ABSTRACT

A meta-analysis of 13 independent microarray data sets was performed and gene expression profiles from cystic fibrosis (CF), similar disorders (COPD: chronic obstructive pulmonary disease, IPF: idiopathic pulmonary fibrosis, asthma), environmental conditions (smoking, epithelial injury), related cellular processes (epithelial differentiation/regeneration), and non-respiratory "control" conditions (schizophrenia, dieting), were compared. Similarity among differentially expressed (DE) gene lists was assessed using a permutation test, and a clustergram was constructed, identifying common gene markers. Global gene expression values were standardized using a novel approach, revealing that similarities between independent data sets run deeper than shared DE genes. Correlation of gene expression values identified putative gene regulators of the CF transmembrane conductance regulator (CFTR) gene, of potential therapeutic significance. Our study provides a novel perspective on CF epithelial gene expression in the context of other lung disorders and conditions, and highlights the contribution of differentiation/EMT and injury to gene signatures of respiratory disease.


Subject(s)
Cystic Fibrosis Transmembrane Conductance Regulator/genetics , Cystic Fibrosis/genetics , Gene Expression Profiling , Gene Expression Regulation , Transcriptome , Asthma/genetics , Asthma/metabolism , Cystic Fibrosis/metabolism , Epithelial Cells/metabolism , Humans , Idiopathic Pulmonary Fibrosis/genetics , Idiopathic Pulmonary Fibrosis/metabolism , Pulmonary Disease, Chronic Obstructive/genetics , Pulmonary Disease, Chronic Obstructive/metabolism , Smoking
13.
J Chem Inf Model ; 54(7): 1833-49, 2014 Jul 28.
Article in English | MEDLINE | ID: mdl-24897621

ABSTRACT

Structurally similar molecules tend to have similar properties, i.e. closer molecules in the molecular space are more likely to yield similar property values while distant molecules are more likely to yield different values. Based on this principle, we propose the use of a new method that takes into account the high dimensionality of the molecular space, predicting chemical, physical, or biological properties based on the most similar compounds with measured properties. This methodology uses ordinary kriging coupled with three different molecular similarity approaches (based on molecular descriptors, fingerprints, and atom matching) which creates an interpolation map over the molecular space that is capable of predicting properties/activities for diverse chemical data sets. The proposed method was tested in two data sets of diverse chemical compounds collected from the literature and preprocessed. One of the data sets contained dihydrofolate reductase inhibition activity data, and the second molecules for which aqueous solubility was known. The overall predictive results using kriging for both data sets comply with the results obtained in the literature using typical QSPR/QSAR approaches. However, the procedure did not involve any type of descriptor selection or even minimal information about each problem, suggesting that this approach is directly applicable to a large spectrum of problems in QSAR/QSPR. Furthermore, the predictive results improve significantly with the similarity threshold between the training and testing compounds, allowing the definition of a confidence threshold of similarity and error estimation for each case inferred. The use of kriging for interpolation over the molecular metric space is independent of the training data set size, and no reparametrizations are necessary when more compounds are added or removed from the set, and increasing the size of the database will consequentially improve the quality of the estimations. Finally it is shown that this model can be used for checking the consistency of measured data and for guiding an extension of the training set by determining the regions of the molecular space for which new experimental measurements could be used to maximize the model's predictive performance.


Subject(s)
Models, Theoretical , Quantitative Structure-Activity Relationship , Folic Acid Antagonists/chemistry , Folic Acid Antagonists/pharmacology , Solubility , Tetrahydrofolate Dehydrogenase/metabolism , Water/chemistry
14.
J Chem Inf Model ; 53(10): 2511-24, 2013 Oct 28.
Article in English | MEDLINE | ID: mdl-24044748

ABSTRACT

Measuring similarity between molecules is a fundamental problem in cheminformatics. Given that similar molecules tend to have similar physical, chemical, and biological properties, the notion of molecular similarity plays an important role in the exploration of molecular data sets, query-retrieval in molecular databases, and in structure-property/activity modeling. Various methods to define structural similarity between molecules are available in the literature, but so far none has been used with consistent and reliable results for all situations. We propose a new similarity method based on atom alignment for the analysis of structural similarity between molecules. This method is based on the comparison of the bonding profiles of atoms on comparable molecules, including features that are seldom found in other structural or graph matching approaches like chirality or double bond stereoisomerism. The similarity measure is then defined on the annotated molecular graph, based on an iterative directed graph similarity procedure and optimal atom alignment between atoms using a pairwise matching algorithm. With the proposed approach the similarities detected are more intuitively understood because similar atoms in the molecules are explicitly shown. This noncontiguous atom matching structural similarity method (NAMS) was tested and compared with one of the most widely used similarity methods (fingerprint-based similarity) using three difficult data sets with different characteristics. Despite having a higher computational cost, the method performed well being able to distinguish either different or very similar hydrocarbons that were indistinguishable using a fingerprint-based approach. NAMS also verified the similarity principle using a data set of structurally similar steroids with differences in the binding affinity to the corticosteroid binding globulin receptor by showing that pairs of steroids with a high degree of similarity (>80%) tend to have smaller differences in the absolute value of binding activity. Using a highly diverse set of compounds with information about the monoamine oxidase inhibition level, the method was also able to recover a significantly higher average fraction of active compounds when the seed is active for different cutoff threshold values of similarity. Particularly, for the cutoff threshold values of 86%, 93%, and 96.5%, NAMS was able to recover a fraction of actives of 0.57, 0.63, and 0.83, respectively, while the fingerprint-based approach was able to recover a fraction of actives of 0.41, 0.40, and 0.39, respectively. NAMS is made available freely for the whole community in a simple Web based tool as well as the Python source code at http://nams.lasige.di.fc.ul.pt/.


Subject(s)
Algorithms , Computer Graphics/statistics & numerical data , Models, Chemical , Software , Enzyme Inhibitors/chemistry , Humans , Internet , Molecular Imprinting , Molecular Structure , Monoamine Oxidase/chemistry , Protein Binding , Receptors, Steroid/chemistry , Research Design , Small Molecule Libraries/chemistry , Steroids/chemistry , Structure-Activity Relationship
15.
J Cheminform ; 5(1): 9, 2013 Feb 11.
Article in English | MEDLINE | ID: mdl-23399299

ABSTRACT

BACKGROUND: One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance. RESULTS: The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach. CONCLUSIONS: The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.

16.
PLoS One ; 7(7): e40519, 2012.
Article in English | MEDLINE | ID: mdl-22848383

ABSTRACT

Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.


Subject(s)
Databases, Protein , Molecular Sequence Annotation/methods , Sequence Analysis, Protein/methods , Software
17.
J Chem Inf Model ; 52(6): 1686-97, 2012 Jun 25.
Article in English | MEDLINE | ID: mdl-22612593

ABSTRACT

The human blood-brain barrier (BBB) is a membrane that protects the central nervous system (CNS) by restricting the passage of solutes. The development of any new drug must take into account its existence whether for designing new molecules that target components of the CNS or, on the other hand, to find new substances that should not penetrate the barrier. Several studies in the literature have attempted to predict BBB penetration, so far with limited success and few, if any, application to real world drug discovery and development programs. Part of the reason is due to the fact that only about 2% of small molecules can cross the BBB, and the available data sets are not representative of that reality, being generally biased with an over-representation of molecules that show an ability to permeate the BBB (BBB positives). To circumvent this limitation, the current study aims to devise and use a new approach based on Bayesian statistics, coupled with state-of-the-art machine learning methods to produce a robust model capable of being applied in real-world drug research scenarios. The data set used, gathered from the literature, totals 1970 curated molecules, one of the largest for similar studies. Random Forests and Support Vector Machines were tested in various configurations against several chemical descriptor set combinations. Models were tested in a 5-fold cross-validation process, and the best one tested over an independent validation set. The best fitted model produced an overall accuracy of 95%, with a mean square contingency coefficient (ϕ) of 0.74, and showing an overall capacity for predicting BBB positives of 83% and 96% for determining BBB negatives. This model was adapted into a Web based tool made available for the whole community at http://b3pp.lasige.di.fc.ul.pt.


Subject(s)
Bayes Theorem , Blood-Brain Barrier , Models, Theoretical , Artificial Intelligence , Humans , Likelihood Functions , Probability , Support Vector Machine
18.
PLoS Comput Biol ; 5(7): e1000443, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19649320

ABSTRACT

In recent years, ontologies have become a mainstream topic in biomedical research. When biological entities are described using a common schema, such as an ontology, they can be compared by means of their annotations. This type of comparison is called semantic similarity, since it assesses the degree of relatedness between two entities by the similarity in meaning of their annotations. The application of semantic similarity to biomedical ontologies is recent; nevertheless, several studies have been published in the last few years describing and evaluating diverse approaches. Semantic similarity has become a valuable tool for validating the results drawn from biomedical studies such as gene clustering, gene expression data analysis, prediction and validation of molecular interactions, and disease gene prioritization. We review semantic similarity measures applied to biomedical ontologies and propose their classification according to the strategies they employ: node-based versus edge-based and pairwise versus groupwise. We also present comparative assessment studies and discuss the implications of their results. We survey the existing implementations of semantic similarity measures, and we describe examples of applications to biomedical research. This will clarify how biomedical researchers can benefit from semantic similarity measures and help them choose the approach most suitable for their studies.Biomedical ontologies are evolving toward increased coverage, formality, and integration, and their use for annotation is increasingly becoming a focus of both effort by biomedical experts and application of automated annotation procedures to create corpora of higher quality and completeness than are currently available. Given that semantic similarity measures are directly dependent on these evolutions, we can expect to see them gaining more relevance and even becoming as essential as sequence similarity is today in biomedical research.


Subject(s)
Computational Biology/methods , Semantics , Terminology as Topic , Algorithms , Biomedical Research/methods , Classification/methods , Natural Language Processing , Software
19.
BMC Bioinformatics ; 10: 231, 2009 Jul 24.
Article in English | MEDLINE | ID: mdl-19630945

ABSTRACT

BACKGROUND: Efficient and accurate prediction of protein function from sequence is one of the standing problems in Biology. The generalised use of sequence alignments for inferring function promotes the propagation of errors, and there are limits to its applicability. Several machine learning methods have been applied to predict protein function, but they lose much of the information encoded by protein sequences because they need to transform them to obtain data of fixed length. RESULTS: We have developed a machine learning methodology, called peptide programs (PPs), to deal directly with protein sequences and compared its performance with that of Support Vector Machines (SVMs) and BLAST in detailed enzyme classification tasks. Overall, the PPs and SVMs had a similar performance in terms of Matthews Correlation Coefficient, but the PPs had generally a higher precision. BLAST performed globally better than both methodologies, but the PPs had better results than BLAST and SVMs for the smaller datasets. CONCLUSION: The higher precision of the PPs in comparison to the SVMs suggests that dealing with sequences is advantageous for detailed protein classification, as precision is essential to avoid annotation errors. The fact that the PPs performed better than BLAST for the smaller datasets demonstrates the potential of the methodology, but the drop in performance observed for the larger datasets indicates that further development is required.Possible strategies to address this issue include partitioning the datasets into smaller subsets and training individual PPs for each subset, or training several PPs for each dataset and combining them using a bagging strategy.


Subject(s)
Computational Biology/methods , Enzymes/chemistry , Peptides/chemistry , Artificial Intelligence , Databases, Protein , Peptides/classification , Proteins/classification , Sequence Analysis, Protein/methods
20.
BMC Bioinformatics ; 9 Suppl 5: S4, 2008 Apr 29.
Article in English | MEDLINE | ID: mdl-18460186

ABSTRACT

BACKGROUND: Several semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations. RESULTS: We conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation. CONCLUSIONS: This work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid simGIC was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.


Subject(s)
Computational Biology/methods , Proteins/classification , Sequence Homology , Algorithms , Artificial Intelligence , Computational Biology/standards , Databases, Protein/standards , Gene Expression Profiling/classification , Gene Expression Profiling/standards , Normal Distribution , Pattern Recognition, Automated/standards , Pattern Recognition, Automated/statistics & numerical data , Proteins/genetics , Proteins/ultrastructure , Reference Standards , Reference Values , Semantics , Sensitivity and Specificity , Structure-Activity Relationship , Vocabulary, Controlled
SELECTION OF CITATIONS
SEARCH DETAIL
...