Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 47
Filter
1.
Mol Inform ; 42(10): e2200275, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37488968

ABSTRACT

Conjugated QSPR models for reactions integrate fundamental chemical laws expressed by mathematical equations with machine learning algorithms. Herein we present a methodology for building conjugated QSPR models integrated with the Arrhenius equation. Conjugated QSPR models were used to predict kinetic characteristics of cycloaddition reactions related by the Arrhenius equation: rate constant l o g k ${{\rm l}{\rm o}{\rm g}k}$ , pre-exponential factor l o g A ${{\rm l}{\rm o}{\rm g}A}$ , and activation energy E a ${{E}_{{\rm a}}}$ . They were benchmarked against single-task (individual and equation-based models) and multi-task models. In individual models, all characteristics were modeled separately, while in multi-task models l o g k ${{\rm l}{\rm o}{\rm g}k}$ , l o g A ${{\rm l}{\rm o}{\rm g}A}$ and E a ${{E}_{{\rm a}}}$ were treated cooperatively. An equation-based model assessed l o g k ${{\rm l}{\rm o}{\rm g}k}$ using the Arrhenius equation and l o g A ${{\rm l}{\rm o}{\rm g}A}$ and E a ${{E}_{{\rm a}}}$ values predicted by individual models. It has been demonstrated that the conjugated QSPR models can accurately predict the reaction rate constants at extreme temperatures, at which reaction rate constants hardly can be measured experimentally. Also, in the case of small training sets conjugated models are more robust than related single-task approaches.

2.
J Chem Inf Model ; 62(22): 5471-5484, 2022 11 28.
Article in English | MEDLINE | ID: mdl-36332178

ABSTRACT

In order to better foramize it, the notorious inverse-QSAR problem (finding structures of given QSAR-predicted properties) is considered in this paper as a two-step process including (i) finding "seed" descriptor vectors corresponding to user-constrained QSAR model output values and (ii) identifying the chemical structures best matching the "seed" vectors. The main development effort here was focused on the latter stage, proposing a new attention-based conditional variational autoencoder neural-network architecture based on recent developments in attention-based methods. The obtained results show that this workflow was capable of generating compounds predicted to display desired activity while being completely novel compared to the training database (ChEMBL). Moreover, the generated compounds show acceptable druglikeness and synthetic accessibility. Both pharmacophore and docking studies were carried out as "orthogonal" in silico validation methods, proving that some of de novo structures are, beyond being predicted active by 2D-QSAR models, clearly able to match binding 3D pharmacophores and bind the protein pocket.


Subject(s)
Quantitative Structure-Activity Relationship , Molecular Docking Simulation
3.
J Chem Inf Model ; 61(10): 4913-4923, 2021 10 25.
Article in English | MEDLINE | ID: mdl-34554736

ABSTRACT

Modern QSAR approaches have wide practical applications in drug discovery for designing potentially bioactive molecules. If such models are based on the use of 2D descriptors, important information contained in the spatial structures of molecules is lost. The major problem in constructing models using 3D descriptors is the choice of a putative bioactive conformation, which affects the predictive performance. The multi-instance (MI) learning approach considering multiple conformations in model training could be a reasonable solution to the above problem. In this study, we implemented several multi-instance algorithms, both conventional and based on deep learning, and investigated their performance. We compared the performance of MI-QSAR models with those based on the classical single-instance QSAR (SI-QSAR) approach in which each molecule is encoded by either 2D descriptors computed for the corresponding molecular graph or 3D descriptors issued for a single lowest energy conformation. The calculations were carried out on 175 data sets extracted from the ChEMBL23 database. It is demonstrated that (i) MI-QSAR outperforms SI-QSAR in numerous cases and (ii) MI algorithms can automatically identify plausible bioactive conformations.


Subject(s)
Algorithms , Quantitative Structure-Activity Relationship , Databases, Factual , Drug Discovery , Molecular Conformation
4.
Mol Inform ; 40(11): e2060030, 2021 11.
Article in English | MEDLINE | ID: mdl-34342944

ABSTRACT

The most widely used QSAR approaches are mainly based on 2D molecular representation which ignores stereoconfiguration and conformational flexibility of compounds. 3D QSAR uses a single conformer of each compound which is difficult to choose reasonably. 4D QSAR uses multiple conformers to overcome the issues of 2D and 3D methods. However, many of existing 4D QSAR models suffer from the necessity to pre-align conformers, while alignment-independent approaches often ignore stereoconfiguration of compounds. In this study we propose a QSAR modeling approach based on transforming chirality-aware 3D pharmacophore descriptors of individual conformers into a set of latent variables representing the whole conformer set of a molecule. This is achieved by clustering together all conformers of all training set compounds. The final representation of a compound is a bit string encoding cluster membership of its conformers. In our study we used Random Forest, but this representation can be used in combination with any machine learning method. We compared this approach with conventional 2D and 3D approaches using multiple data sets and investigated the sensitivity of the approach proposed to tuning parameters: number of conformers and clusters.


Subject(s)
Quantitative Structure-Activity Relationship , Molecular Conformation
5.
Sci Rep ; 11(1): 3178, 2021 02 04.
Article in English | MEDLINE | ID: mdl-33542271

ABSTRACT

The "creativity" of Artificial Intelligence (AI) in terms of generating de novo molecular structures opened a novel paradigm in compound design, weaknesses (stability & feasibility issues of such structures) notwithstanding. Here we show that "creative" AI may be as successfully taught to enumerate novel chemical reactions that are stoichiometrically coherent. Furthermore, when coupled to reaction space cartography, de novo reaction design may be focused on the desired reaction class. A sequence-to-sequence autoencoder with bidirectional Long Short-Term Memory layers was trained on on-purpose developed "SMILES/CGR" strings, encoding reactions of the USPTO database. The autoencoder latent space was visualized on a generative topographic map. Novel latent space points were sampled around a map area populated by Suzuki reactions and decoded to corresponding reactions. These can be critically analyzed by the expert, cleaned of irrelevant functional groups and eventually experimentally attempted, herewith enlarging the synthetic purpose of popular synthetic pathways.

6.
Expert Opin Drug Discov ; 16(9): 929-931, 2021 09.
Article in English | MEDLINE | ID: mdl-33605818
7.
Int J Mol Sci ; 21(15)2020 Aug 03.
Article in English | MEDLINE | ID: mdl-32756326

ABSTRACT

Nowadays, the problem of the model's applicability domain (AD) definition is an active research topic in chemoinformatics. Although many various AD definitions for the models predicting properties of molecules (Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) models) were described in the literature, no one for chemical reactions (Quantitative Reaction-Property Relationships (QRPR)) has been reported to date. The point is that a chemical reaction is a much more complex object than an individual molecule, and its yield, thermodynamic and kinetic characteristics depend not only on the structures of reactants and products but also on experimental conditions. The QRPR models' performance largely depends on the way that chemical transformation is encoded. In this study, various AD definition methods extensively used in QSAR/QSPR studies of individual molecules, as well as several novel approaches suggested in this work for reactions, were benchmarked on several reaction datasets. The ability to exclude wrong reaction types, increase coverage, improve the model performance and detect Y-outliers were tested. As a result, several "best" AD definitions for the QRPR models predicting reaction characteristics have been revealed and tested on a previously published external dataset with a clear AD definition problem.


Subject(s)
Cheminformatics/trends , Protein Domains , Quantitative Structure-Activity Relationship , Thermodynamics , Chemical Phenomena , Kinetics , Models, Molecular
8.
Chem Soc Rev ; 49(11): 3525-3564, 2020 06 07.
Article in English | MEDLINE | ID: mdl-32356548

ABSTRACT

Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure-activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.


Subject(s)
Chemistry, Pharmaceutical/methods , Drug-Related Side Effects and Adverse Reactions/metabolism , Pharmaceutical Preparations/chemistry , Algorithms , Animals , Artificial Intelligence , Databases, Factual , Drug Design , History, 20th Century , History, 21st Century , Humans , Models, Molecular , Quantitative Structure-Activity Relationship , Quantum Theory , Reproducibility of Results
9.
10.
Mol Inform ; 39(12): e2000009, 2020 12.
Article in English | MEDLINE | ID: mdl-32347666

ABSTRACT

Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a given chemical space. Intuitively, the FS size must raise with the size and diversity of the target library. At the same time, the GTM training can be very slow or even becomes technically impossible at FS sizes of the order of 105 compounds - which is a very small number compared to today's commercially accessible compounds, and, especially, to the theoretically feasible molecules. In order to solve this problem, we propose a Parallel GTM algorithm based on the merging of "intermediate" manifolds constructed in parallel for different subsets of molecules. An ensemble of these subsets forms a FS for the "final" manifold. In order to assess the efficiency of the new algorithm, 80 GTMs were built on the FSs of different sizes ranging from 10 to 1.8 M compounds selected from the ChEMBL database. Each GTM was challenged to build classification models for up to 712 biological activities (depending on the FS size). With the novel parallel GTM procedure, we could thus cover the entire spectrum of possible FS sizes, whereas previous studies were forced to rely on the working hypothesis that FS sizes of few thousands of compounds are sufficient to describe the ChEMBL chemical space. In fact, this study formally proves this to be true: a FS containing only 5000 randomly picked compounds is sufficient to represent the entire ChEMBL collection (1.8 M molecules), in the sense that a further increase of FS compound numbers has no benefice impact on the predictive propensity of the above-mentioned 712 activity classification models. Parallel GTM may, however, be required to generate maps based on very large FS, that might improve chemical space cartography of big commercial and virtual libraries, approaching billions of compounds.


Subject(s)
Algorithms , Big Data , Benchmarking , Databases, Chemical , Entropy
11.
Expert Opin Drug Discov ; 15(7): 755-764, 2020 07.
Article in English | MEDLINE | ID: mdl-32228116

ABSTRACT

INTRODUCTION: Deep discriminative and generative neural-network models are becoming an integral part of the modern approach to ligand-based novel drug discovery. The variety of different architectures of neural networks, the methods of their training, and the procedures of generating new molecules require expert knowledge to choose the most suitable approach. AREAS COVERED: Three different approaches to deep learning use in ligand-based drug discovery are considered: virtual screening, neural generative models, and mutation-based structure generation. Several architectures of neural networks for building either discriminative or generative models are considered in this paper, including deep multilayer neural networks, different kinds of convolutional neural networks, recurrent neural networks, and several types of autoencoders. Several kinds of learning frameworks are also considered, including adversarial learning and reinforcement learning. Different types of representations for generating molecules, including SMILES, graphs, and several alternative string representations are also considered. EXPERT OPINION: Two kinds of problem should be solved in order to make the models built using deep neural networks, especially generative models, a valuable option in ligand-based drug discovery: the issue of interpretability and explainability of deep-learning models and the issue of synthetic accessibility of novel compounds designed by deep-learning algorithms.


Subject(s)
Deep Learning , Drug Discovery/methods , Neural Networks, Computer , Algorithms , Drug Design , Humans , Ligands
12.
Mol Inform ; 39(6): e1900170, 2020 06.
Article in English | MEDLINE | ID: mdl-32090493

ABSTRACT

Generative Topographic Mapping (GTM) is a dimensionality reduction method, which is widely used for both data visualization and structure-activity modeling. Large dimensionality of the initial data space may require significant computational resources and slow down the GTM construction. Therefore, it may be meaningful to reduce the number of descriptors used for encoding molecular structures. The Principal Component Analysis (PCA), a standard preprocessing tool, suffers from the information loss upon the dimensionality reduction. As an alternative, we propose to use substructure vector embedding provided by the mol2vec technique. In addition to the data dimensionality reduction, this technology also accounts for proximity of substructures in molecular graphs. In this study, dimensionality of large descriptor spaces of ISIDA fragment descriptors or Morgan fingerprints were reduced using either the PCA or the mol2vec method. The latter significantly speeds up GTM training without compromising its predictive power in bioactivity classification tasks.


Subject(s)
Algorithms , Data Analysis , Data Visualization , Principal Component Analysis
13.
J Chem Inf Model ; 59(11): 4569-4576, 2019 11 25.
Article in English | MEDLINE | ID: mdl-31638794

ABSTRACT

Here, we describe a concept of conjugated models for several properties (activities) linked by a strict mathematical relationship. This relationship can be directly integrated analytically into the ridge regression (RR) algorithm or accounted for in a special case of "twin" neural networks (NN). Developed approaches were applied to the modeling of the logarithm of the prototropic tautomeric constant (logKT) which can be expressed as the difference between the acidity constants (pKa) of two related tautomers. Both conjugated and individual RR and NN models for logKT and pKa were developed. The modeling set included 639 tautomeric constants and 2371 acidity constants of organic molecules in various solvents. A descriptor vector for each reaction resulted from the concatenation of structural descriptors and some parameters for reaction conditions. For the former, atom-centered substructural fragments describing acid sites in tautomer molecules were used. The latter were automatically identified using the condensed graph of reaction approach. Conjugated models performed similarly to the best individual models for logKT and pKa. At the same time, the physically grounded relationship between logKT and pKa was respected only for conjugated but not individual models.


Subject(s)
Organic Chemicals/chemistry , Pharmaceutical Preparations/chemistry , Acids/chemistry , Algorithms , Drug Discovery , Models, Chemical , Molecular Structure , Neural Networks, Computer , Quantitative Structure-Activity Relationship , Solvents/chemistry , Stereoisomerism
14.
Future Med Chem ; 11(20): 2701-2713, 2019 10.
Article in English | MEDLINE | ID: mdl-31596146

ABSTRACT

The analysis of information on the spatial structure of molecules and the physical fields of their interactions with biological targets is extremely important for solving various problems in drug discovery. This mini-review article surveys the main features of the continuous molecular fields approach and its use for analyzing structure-activity relationships in 3D space, building 3D quantitative structure-activity models and conducting similarity based virtual screening. Particular attention is paid to the consideration of the concept of molecular co-fields and their use for the interpretation of 3D structure-activity models. The principles of molecular design based on the overlapping and the similarity of molecular fields with corresponding co-fields are formulated.


Subject(s)
Molecular Structure , Hydrogen Bonding , Hydrophobic and Hydrophilic Interactions , Models, Molecular , Structure-Activity Relationship
16.
J Chem Inf Model ; 59(3): 1182-1196, 2019 03 25.
Article in English | MEDLINE | ID: mdl-30785751

ABSTRACT

Here we show that Generative Topographic Mapping (GTM) can be used to explore the latent space of the SMILES-based autoencoders and generate focused molecular libraries of interest. We have built a sequence-to-sequence neural network with Bidirectional Long Short-Term Memory layers and trained it on the SMILES strings from ChEMBL23. Very high reconstruction rates of the test set molecules were achieved (>98%), which are comparable to the ones reported in related publications. Using GTM, we have visualized the autoencoder latent space on the two-dimensional topographic map. Targeted map zones can be used for generating novel molecular structures by sampling associated latent space points and decoding them to SMILES. The sampling method based on a genetic algorithm was introduced to optimize compound properties "on the fly". The generated focused molecular libraries were shown to contain original and a priori feasible compounds which, pending actual synthesis and testing, showed encouraging behavior in independent structure-based affinity estimation procedures (pharmacophore matching, docking).


Subject(s)
Deep Learning , Drug Design , Catalytic Domain , Drug Evaluation, Preclinical , Ligands , Molecular Docking Simulation , Receptor, Adenosine A2A/chemistry , Receptor, Adenosine A2A/metabolism , Small Molecule Libraries/metabolism , Small Molecule Libraries/pharmacology
17.
Mol Inform ; 37(9-10): e1800056, 2018 09.
Article in English | MEDLINE | ID: mdl-30039933

ABSTRACT

Generative Topographic Mapping (GTM) approach was successfully used to visualize, analyze and model the equilibrium constants (KT ) of tautomeric transformations as a function of both structure and experimental conditions. The modeling set contained 695 entries corresponding to 350 unique transformations of 10 tautomeric types, for which KT values were measured in different solvents and at different temperatures. Two types of GTM-based classification models were trained: first, a "structural" approach focused on separating tautomeric classes, irrespective of reaction conditions, then a "general" approach accounting for both structure and conditions. In both cases, the cross-validated Balanced Accuracy was close to 1 and the clusters, assembling equilibria of particular classes, were well separated in 2-dimentional GTM latent space. Data points corresponding to similar transformations measured under different experimental conditions, are well separated on the maps. Additionally, GTM-driven regression models were found to have their predictive performance dependent on different scenarios of the selection of local fragment descriptors involving special marked atoms (proton donors or acceptors). The application of local descriptors significantly improves the model performance in 5-fold cross-validation: RMSE=0.63 and 0.82 logKT units with and without local descriptors, respectively. This trend was as well observed for SVR calculations, performed for the comparison purposes.


Subject(s)
Algorithms , Molecular Dynamics Simulation , Organic Chemicals/chemistry , Isomerism , Solvents/chemistry
18.
Methods Mol Biol ; 1800: 119-139, 2018.
Article in English | MEDLINE | ID: mdl-29934890

ABSTRACT

Various methods of machine learning, supervised and unsupervised, linear and nonlinear, classification and regression, in combination with various types of molecular descriptors, both "handcrafted" and "data-driven," are considered in the context of their use in computational toxicology. The use of multiple linear regression, variants of naïve Bayes classifier, k-nearest neighbors, support vector machine, decision trees, ensemble learning, random forest, several types of neural networks, and deep learning is the focus of attention of this review. The role of fragment descriptors, graph mining, and graph kernels is highlighted. The application of unsupervised methods, such as Kohonen's self-organizing maps and related approaches, which allow for combining predictions with data analysis and visualization, is also considered. The necessity of applying a wide range of machine learning methods in computational toxicology is underlined.


Subject(s)
Computer Simulation , Machine Learning , Toxicology/methods , Algorithms , Deep Learning , Linear Models , Neural Networks, Computer , Quantitative Structure-Activity Relationship , Support Vector Machine
19.
J Comput Aided Mol Des ; 31(8): 701-714, 2017 Aug.
Article in English | MEDLINE | ID: mdl-28688089

ABSTRACT

Generative topographic mapping (GTM) approach is used to visualize the chemical space of organic molecules (L) with respect to binding a wide range of 41 different metal cations (M) and also to build predictive models for stability constants (logK) of 1:1 (M:L) complexes using "density maps," "activity landscapes," and "selectivity landscapes" techniques. A two-dimensional map describing the entire set of 2962 metal binders reveals the selectivity and promiscuity zones with respect to individual metals or groups of metals with similar chemical properties (lanthanides, transition metals, etc). The GTM-based global (for entire set) and local (for selected subsets) models demonstrate a good predictive performance in the cross-validation procedure. It is also shown that the data likelihood could be used as a definition of the applicability domain of GTM-based models. Thus, the GTM approach represents an efficient tool for the predictive cartography of metal binders, which can both visualize their chemical space and predict the affinity profile of metals for new ligands.


Subject(s)
Chelating Agents/chemistry , Coordination Complexes/chemistry , Metals/chemistry , Algorithms , Computer Simulation , Ligands , Likelihood Functions , Molecular Structure , Structure-Activity Relationship , Thermodynamics
20.
Mol Inform ; 36(11)2017 11.
Article in English | MEDLINE | ID: mdl-28627811

ABSTRACT

In Energy-Based Neural Networks (EBNNs), relationships between variables are captured by means of a scalar function conventionally called "energy". In this article, we introduce a procedure of "harmony search", which looks for compounds providing the lowest energies for the EBNNs trained on active compounds. It can be considered as a special kind of similarity search that takes into account regularities in the structures of active compounds. In this paper, we show that harmony search can be used for performing virtual screening. The performance of the harmony search based on two types of EBNNs, the Hopfield Networks (HNs) and the Restricted Boltzmann Machines (RBMs), was compared with the performance of the similarity search based on Tanimoto coefficient with "data fusion". The AUC measure for ROC curves and 1 %-enrichment rates for 20 targets were used in the benchmarking. Five different scores were computed: the energy for HNs, the free energy and the reconstruction error for RBMs, the mean and the maximum values of Tanimoto coefficients. The performance of the harmony search was shown to be comparable or even superior (significantly for several targets) to the performance of the similarity search. Important advantages of using the harmony search for virtual screening are very high computational efficiency of prediction, the ability to reveal and take into account regularities in active structures, flexibility and interpretability of models, etc.


Subject(s)
Neural Networks, Computer , Algorithms
SELECTION OF CITATIONS
SEARCH DETAIL
...