Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
1.
Anal Chem ; 90(21): 12752-12760, 2018 11 06.
Article in English | MEDLINE | ID: mdl-30350614

ABSTRACT

Liquid chromatography coupled with electrospray ionization tandem mass spectrometry (LC-ESI-MS/MS) is a major analytical technique used for nontargeted identification of metabolites in biological fluids. Typically, in LC-ESI-MS/MS based database assisted structure elucidation pipelines, the exact mass of an unknown compound is used to mine a chemical structure database to acquire an initial set of possible candidates. Subsequent matching of the collision induced dissociation (CID) spectrum of the unknown to the CID spectra of candidate structures facilitates identification. However, this approach often fails because of the large numbers of potential candidates (i.e., false positives) for which CID spectra are not available. To overcome this problem, CID fragmentation predication programs have been developed, but these also have limited success if large numbers of isomers with similar CID spectra are present in the candidate set. In this study, we investigated the use of a retention index (RI) predictive model as an orthogonal method to help improve identification rates. The model was used to eliminate candidate structures whose predicted RI values differed significantly from the experimentally determined RI value of the unknown compound. We tested this approach using a set of ninety-one endogenous metabolites and four in silico CID fragmentation algorithms: CFM-ID, CSI:FingerID, Mass Frontier, and MetFrag. Candidate sets obtained from PubChem and the Human Metabolite Database (HMDB) were ranked with and without RI filtering followed by in silico spectral matching. Upon RI filtering, 12 of the ninety-one metabolites were eliminated from their respective candidate sets, i.e., were scored incorrectly as negatives. For the remaining seventy-nine compounds, we show that RI filtering eliminated an average of 58% from PubChem candidate sets. This resulted in an approximately 2-fold improvement in average rankings when using CFM-ID, Mass Frontier, and MetFrag. In addition, RI filtering slightly increased the occurrence of number one rankings for all 4 fragmentation algorithms. However, RI filtering did not significantly improve average rankings when HMDB was used as the candidate database, nor did it significantly improve average rankings when using CSI:FingerID. Overall, we show that the current RI model incorrectly eliminated more true positives (12) than were expected (4-5) on the basis of the filtering method. However, it slightly improved the number of correct first place rankings and improved overall average rankings when using CFM-ID, Mass Frontier, and MetFrag.


Subject(s)
Databases, Chemical/statistics & numerical data , Metabolomics/methods , Models, Chemical , Neural Networks, Computer , Algorithms , Chromatography, Liquid , Computer Simulation , Molecular Structure , Spectrometry, Mass, Electrospray Ionization
2.
J Chem Inf Model ; 58(3): 591-604, 2018 03 26.
Article in English | MEDLINE | ID: mdl-29489351

ABSTRACT

The MolFind application has been developed as a nontargeted metabolomics chemometric tool to facilitate structure identification when HPLC biofluids analysis reveals a feature of interest. Here synthetic compounds are selected and measured to form the basis of a new, more accurate, HPLC retention index model for use with MolFind. We show that relatively inexpensive synthetic screening compounds with simple structures can be used to develop an artificial neural network model that is successful in making quality predictions for human metabolites. A total of 1955 compounds were obtained and measured for the model. A separate set of 202 human metabolites was used for independent validation. The new ANN model showed improved accuracy over previous models. The model, based on relatively simple compounds, was able to make quality predictions for complex compounds not similar to training data. Independent validation metabolites with feature combinations found in three or more training compounds were predicted with 97% sensitivity while metabolites with feature combinations found in less than three training compounds were predicted with >90% sensitivity. The study describes the method used to select synthetic compounds and new descriptors developed to encode the relationship between lipophilic molecular subgraphs and HPLC retention. Finally, we introduce the QRI (qualitative range of interest) modification of neural network backpropagation learning to generate models simultaneously based on quantitative and qualitative data.


Subject(s)
Chromatography, High Pressure Liquid/methods , Chromatography, Reverse-Phase/methods , Metabolomics/methods , Humans , Metabolome , Neural Networks, Computer
3.
Bioanalysis ; 7(8): 939-55, 2015.
Article in English | MEDLINE | ID: mdl-25966007

ABSTRACT

BACKGROUND: Artificial Neural Networks (ANN) are extensively used to model 'omics' data. Different modeling methodologies and combinations of adjustable parameters influence model performance and complicate model optimization. METHODOLOGY: We evaluated optimization of four ANN modeling parameters (learning rate annealing, stopping criteria, data split method, network architecture) using retention index (RI) data for 390 compounds. Models were assessed by independent validation (I-Val) using newly measured RI values for 1492 compounds. CONCLUSION: The best model demonstrated an I-Val standard error of 55 RI units and was built using a Ward's clustering data split and a minimally nonlinear network architecture. Use of validation statistics for stopping and final model selection resulted in better independent validation performance than the use of test set statistics.


Subject(s)
Artificial Intelligence/standards , Chromatography, High Pressure Liquid/methods , Metabolomics , Neural Networks, Computer , Systems Biology/standards , Cluster Analysis , Databases, Factual , Tandem Mass Spectrometry/methods
5.
Curr Comput Aided Drug Des ; 10(4): 374-82, 2014.
Article in English | MEDLINE | ID: mdl-25549758

ABSTRACT

A novel approach is developed for modeling situations in which the modeled property is an algebraically transformed version of the original experimental data. In many cases such a transformation results in a data set with a significantly smaller data range. Here we explore the effects of range-of-data on modeling statistics. We illustrate a twostep method using data on the mass spectrometry collision energy (CE) that is required to decompose 50% of precursor ions to fragments (CE50). Earlier we showed that a nonlinear center-of-mass transformation, yielding Ecom50, produces values less dependent on the specific mass spectrometric experimental conditions. For this data set the Ecom50 range is 13.5% of the CE50 range. We propose a two-step modeling method. First, the original experimental data, CE50, (larger range-of-data) is modeled by a standard modeling method (PLS). Second, the calculated dependent variable resulting from the modeling is algebraically transformed (not modeled) according to the center-of-mass transformation, providing the generally more useful data, Ecom50. As shown here, use of this two-step method for predicting Ecom50 (from previously published data) produces a standard error 21% smaller and correspondingly reduces the confidence interval for prediction. Some specific implications for prediction are given for a published data set. This work is part of the ongoing development of a system of models to assist in the development of human metabolites.


Subject(s)
Models, Statistical , Humans , Mass Spectrometry
6.
Anal Chem ; 84(21): 9388-94, 2012 Nov 06.
Article in English | MEDLINE | ID: mdl-23039714

ABSTRACT

In this paper, we present MolFind, a highly multithreaded pipeline type software package for use as an aid in identifying chemical structures in complex biofluids and mixtures. MolFind is specifically designed for high-performance liquid chromatography/mass spectrometry (HPLC/MS) data inputs typical of metabolomics studies where structure identification is the ultimate goal. MolFind enables compound identification by matching HPLC/MS-based experimental data obtained for an unknown compound with computationally derived HPLC/MS values for candidate compounds downloaded from chemical databases such as PubChem. The downloaded "bins" consist of all compounds matching the monoisotopic molecular weight of the unknown. The computational HPLC/MS values predicted include retention index (RI), ECOM(50) (energy required to fragment 50% of a selected precursor ion), drift time, and collision induced dissociation (CID) spectrum. RI, ECOM(50), and drift-time models are used for filtering compounds downloaded from PubChem. The remaining candidates are then ranked based on CID spectra matching. Current RI and ECOM(50) models allow for the removal of about 28% of compounds from PubChem bins. Our estimates suggest that this could be improved to as much as 87% with additional chemical structures included in the computational models. Quantitative structure property relationship-based modeling of drift times showed a better correlation with experimentally determined drift times than did Mobcal cross-sectional areas. In 23 of 35 example cases, filtering PubChem bins with RI and ECOM(50) predictive models resulted in improved ranking of the unknown compounds compared to previous studies using CID spectra matching alone. In 19 of 35 examples, the correct candidate was ranked within the top 20 compounds in bins containing an average of 1635 compounds.


Subject(s)
Chromatography, High Pressure Liquid/methods , Mass Spectrometry/methods , Software
7.
J Chem Inf Model ; 52(5): 1222-37, 2012 May 25.
Article in English | MEDLINE | ID: mdl-22489687

ABSTRACT

The goal of many metabolomic studies is to identify the molecular structure of endogenous molecules that are differentially expressed among sampled or treatment groups. The identified compounds can then be used to gain an understanding of disease mechanisms. Unfortunately, despite recent advances in a variety of analytical techniques, small molecule (<1000 Da) identification remains difficult. Rarely can a chemical structure be determined from experimental "features" such as retention time, exact mass, and collision induced dissociation spectra. Thus, without knowing structure, biological significance remains obscure. In this study, we explore an identification method in which the measured exact mass of an unknown is used to query available chemical databases to compile a list of candidate compounds. Predictions are made for the candidates using models of experimental features that have been measured for the unknown. The predicted values are used to filter the candidate list by eliminating compounds with predicted values substantially different from the unknown. The intent is to reduce the list of candidates to a reasonable number that can be obtained and measured for confirmation. To facilitate this exploration, we measured data and created models for two experimental features; MS Ecom50 (the energy in electronvolts required to fragment 50% of a selected precursor ion) and HPLC retention index. Using a data set of 52 compounds, Ecom50 models were developed based on both Molconn and CODESSA structural descriptors. These models gave r² values of 0.89 to 0.94 depending on the number of inputs, the modeling algorithm chosen, and whether neutral or protonated structures were used. The retention index model was developed with 400 compounds using a back-propagation artificial neural network and 33 Molconn structure descriptors. External validation gave a v² = 0.87 and standard error of 38 retention index units. As a test of the validity of the filtering approach, the Ecom50 and retention index models, along with exact mass and collision induced dissociation spectra matching, were used to identify 1,3-dicyclohexylurea in human plasma. This compound was not previously known to exist in human biofluids and its elemental formula was identical to 315 other candidate compounds downloaded from PubChem. These results suggest that the use of Ecom50 and retention index predictive models can improve nontargeted metabolite structure identification using HPLC/MS derived structural features.


Subject(s)
Chromatography, High Pressure Liquid , Mass Spectrometry , Metabolomics/methods , Models, Biological , Urea/analogs & derivatives , Databases, Factual , Humans , Urea/blood , Urea/chemistry
8.
J Chem Inf Model ; 49(4): 788-99, 2009 Apr.
Article in English | MEDLINE | ID: mdl-19309176

ABSTRACT

A back-propagation artificial neural network (ANN) was used to create a 10-fold leave-10%-out cross-validated ensemble model of high performance liquid chromatography retention index (HPLC-RI) for a data set of 498 diverse druglike compounds. A 10-fold multiple linear regression (MLR) ensemble model of the same data was developed for comparison. Molecular structure was described using IGroup E-state indices, a novel set of structure-information representation (SIR) descriptors, along with molecular connectivity chi and kappa indices and other SIR descriptors previously reported. The same input descriptors were used to develop models by both learning algorithms. The MLR model yielded marginally acceptable statistics with training correlation r(2) = 0.65, mean absolute error (MAE) = 83 RI units. External validation of 104 compounds not used for model development yielded validation v(2) = 0.49 and MAE = 73 RI units. The distribution of residuals for the fit and validate data sets suggest a nonlinear relationship between retention index and molecular structure as described by the SIR indices. Not surprisingly, the ANN model was significantly more accurate for both training and validation with training set r(2) = 0.93, MAE = 30 RI units and validation v(2) = 0.84, MAE = 41 RI units. For the ANN model, a total of 91% of validation predictions were within 100 RI units of the experimental value.


Subject(s)
Chromatography, High Pressure Liquid/statistics & numerical data , Neural Networks, Computer , Algorithms , Artificial Intelligence , Cluster Analysis , Databases, Factual , Forecasting , Linear Models , Models, Chemical , Quantitative Structure-Activity Relationship , Reproducibility of Results , Subject Headings
9.
Bioanalysis ; 1(9): 1627-43, 2009 Dec.
Article in English | MEDLINE | ID: mdl-21083108

ABSTRACT

MS and HPLC are commonly used for compound characterization and obtaining structural information; in the field of metabonomics, these two analytical techniques are often combined to characterize unknown endogenous or exogenous metabolites present in complex biological samples. Since the structures of a majority of these metabolites are not actually identified, the result of most metabonomic studies is a list of m/z values and retention times. However, without knowing actual structures, the biological significance of these 'features' cannot be determined. The process of identifying the structures of unknown compounds can be time intensive, costly and frequently requires the use of multiple orthogonal analytical techniques - this laborious procedure seems insurmountable for the long lists of unknowns that must be identified for each study. In addition, the limited sample volume and the extremely low concentration of most endogenous analytes frequently make purification and identification by other instrumentation nearly impossible. This review is intended to explore the problems and progress with current tools that are available for MS-based structure identification for both endogenous and exogenous metabolites.


Subject(s)
Body Fluids/chemistry , Body Fluids/metabolism , Databases, Factual , Mass Spectrometry/methods , Metabolomics/methods , Humans , Molecular Structure
10.
J Med Chem ; 49(24): 7169-81, 2006 Nov 30.
Article in English | MEDLINE | ID: mdl-17125269

ABSTRACT

Four modeling techniques, using topological descriptors to represent molecular structure, were employed to produce models of human serum protein binding (% bound) on a data set of 1008 experimental values, carefully screened from publicly available sources. To our knowledge, this data is the largest set on human serum protein binding reported for QSAR modeling. The data was partitioned into a training set of 808 compounds and an external validation test set of 200 compounds. Partitioning was accomplished by clustering the compounds in a structure descriptor space so that random sampling of 20% of the whole data set produced an external test set that is a good representative of the training set with respect to both structure and protein binding values. The four modeling techniques include multiple linear regression (MLR), artificial neural networks (ANN), k-nearest neighbors (kNN), and support vector machines (SVM). With the exception of the MLR model, the ANN, kNN, and SVM QSARs were ensemble models. Training set correlation coefficients and mean absolute error ranged from r2=0.90 and MAE=7.6 for ANN to r2=0.61 and MAE=16.2 for MLR. Prediction results from the validation set yielded correlation coefficients and mean absolute errors which ranged from r2=0.70 and MAE=14.1 for ANN to a low of r2=0.59 and MAE=18.3 for the SVM model. Structure descriptors that contribute significantly to the models are discussed and compared with those found in other published models. For the ANN model, structure descriptor trends with respect to their affects on predicted protein binding can assist the chemist in structure modification during the drug design process.


Subject(s)
Blood Proteins/metabolism , Models, Molecular , Pharmaceutical Preparations/metabolism , Quantitative Structure-Activity Relationship , Drug Design , Humans , Linear Models , Neural Networks, Computer , Protein Binding
11.
Chem Biodivers ; 1(11): 1829-41, 2004 Nov.
Article in English | MEDLINE | ID: mdl-17191819

ABSTRACT

Several QSPR models were developed for predicting intrinsic aqueous solubility, S(o). A data set of 5,964 neutral compounds was sub-divided into two classes, aromatic and non-aromatic compounds. Three models were created with different methods on both data sets: two regression models (multiple linear regression and partial least squares) and an artificial neural network model. These models were based on 3343 aromatic and 1674 non-aromatic compounds for training sets; 938 compounds were used in external validation testing. The range in -log S(o) is -1.6 to 10. Topological structure descriptors were used with all models. A genetic algorithm was used for descriptor selection for regression models. For the artificial neural network (ANN) model, descriptor selection was done with a backward elimination process. All models performed well with r2 values ranging 0.72 to 0.84 in external validation testing. The mean absolute errors in validation ranged from 0.44 to 0.80 for the classes of compounds for all the models. These statistical results indicate a sound ANN model. Furthermore, in a comparison with eight other available models, based on predictions using a validation test set (442 compounds), the artificial neural network model presented in this work (CSLogWS) was clearly superior based on both the mean absolute error and the percentage of residuals less than one log unit. In the ANN model both E-State and hydrogen E-State descriptors were found to be important.


Subject(s)
Databases, Factual , Models, Molecular , Quantitative Structure-Activity Relationship , Water/chemistry , Molecular Structure , Predictive Value of Tests , Solubility
12.
J Chem Inf Comput Sci ; 43(6): 2120-8, 2003.
Article in English | MEDLINE | ID: mdl-14632464

ABSTRACT

The binding affinity to human serum albumin for 94 drugs was modeled with topological descriptors of molecular structure, using as experimental data the HPLC chromatographic retention index [logk(HSA)] on immobilized albumin. The electrotopological state (E-State) along with the molecular connectivity chi indices provided the basis for a satisfactory model: r(2) = 0.77, s = 0.29, q(2) = 0.70, s(press) = 0.33. The 10% leave-group-out (LGO) cross-validation method yielded q(2) (= r(2)(press)) = 0.69. Further, the model was tested on a 10 compound external validation set, yielding a mean absolute error, MAE = 0.31; q(2) (= r(2)(press)) = 0.74. MDL QSAR software was used for setting up the data set, creation of combination descriptors, modeling, and database management. All the statistical tests indicate that the topological model is useful for property estimation. Internal and external validation methods were used, and the results indicate that the model is useful for prediction. Randomizations of the activity values also indicate statistically sound models are very different from random statistics. The model indicates that positive factors for binding affinity include electron accessibility and the number of aromatic rings, aliphatic CH groups (-CH(3), -CH(2)-, >CH-), halogens (fluorine and chlorine), and -OH groups. Five-membered heteroatomic rings present a negative factor, whereas six-membered heteroatomic rings present a positive factor. The specific information described can be used as an aid to the drug design process.


Subject(s)
Albumins/chemistry , Albumins/metabolism , Blood Proteins/metabolism , Models, Molecular , Protein Binding , Protein Conformation , Quantitative Structure-Activity Relationship , Terminology as Topic
13.
J Comput Aided Mol Des ; 17(2-4): 103-18, 2003.
Article in English | MEDLINE | ID: mdl-13677479

ABSTRACT

The binding of beta-lactams to human serum proteins was modeled with topological descriptors of molecular structure. Experimental data was the concentration of protein-bound drug expressed as a percent of the total plasma concentration (percent fraction bound, PFB) for 87 penicillins and for 115 beta-lactams. The electrotopological state indices (E-State) and the molecular connectivity chi indices were found to be the basis of two satisfactory models. A data set of 74 penicillins from a drug design series was successfully modeled with statistics: r2 = 0.80, s = 12.1, q2 = 0.76, spress = 13.4. This model was then used to predict protein binding (PFB) for 13 commercial penicillins, resulting in a very good mean absolute error, MAE = 12.7 and correlation coefficient, q2 = 0.84. A group of 28 cephalosporins were combined with the penicillin data to create a dataset of 115 beta-lactams that was successfully modeled: r2 = 0.82, s = 12.7, q2 = 0.78, spress = 13.7. A ten-fold 10% leave-group-out (LGO) cross-validation procedure was implemented, leading to very good statistics: MAE = 10.9, spress = 14.0, q2 (or r2press) = 0.78. The models indicate a combination of general and specific structure features that are important for estimating protein binding in this class of antibiotics. For the beta-lactams, significant factors that increase binding are presence and electron accessibility of aromatic rings, halogens, methylene groups, and =N- atoms. Significant negative influence on binding comes from amine groups and carbonyl oxygen atoms.


Subject(s)
Blood Proteins/metabolism , Models, Chemical , beta-Lactams/metabolism , Humans , Molecular Structure , Penicillins/metabolism , Penicillins/pharmacokinetics , Protein Binding , Quantitative Structure-Activity Relationship , beta-Lactams/pharmacokinetics
SELECTION OF CITATIONS
SEARCH DETAIL
...