Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
J Cheminform ; 15(1): 11, 2023 Feb 02.
Article in English | MEDLINE | ID: mdl-36732800

ABSTRACT

In this work, we provide further development of the junction tree variational autoencoder (JT VAE) architecture in terms of implementation and application of the internal feature space of the model. Pretraining of JT VAE on a large dataset and further optimization with a regression model led to a latent space that can solve several tasks simultaneously: prediction, generation, and optimization. We use the ZINC database as a source of molecules for the JT VAE pretraining and the QM9 dataset with its HOMO values to show the application case. We evaluate our model on multiple tasks such as property (value) prediction, generation of new molecules with predefined properties, and structure modification toward the property. Across these tasks, our model shows improvements in generation and optimization tasks while preserving the precision of state-of-the-art models.

2.
Angew Chem Int Ed Engl ; 62(11): e202218659, 2023 Mar 06.
Article in English | MEDLINE | ID: mdl-36688354

ABSTRACT

Catalyst optimization processes typically rely on inductive and qualitative assumptions of chemists based on screening data. While machine learning models using molecular properties or calculated 3D structures enable quantitative data evaluation, costly quantum chemical calculations are often required. In contrast, readily available binary fingerprint descriptors are time- and cost-efficient, but their predictive performance remains insufficient. Here, we describe a machine learning model based on fragment descriptors, which are fine-tuned for asymmetric catalysis and represent cyclic or polyaromatic hydrocarbons, enabling robust and efficient virtual screening. Using training data with only moderate selectivities, we designed theoretically and validated experimentally new catalysts showing higher selectivities in a challenging asymmetric tetrahydropyran synthesis.

3.
J Chem Inf Model ; 62(9): 2015-2020, 2022 05 09.
Article in English | MEDLINE | ID: mdl-34843251

ABSTRACT

This work introduces CGRdb2.0─an open-source database management system for molecules, reactions, and chemical data. CGRdb2.0 is a Python package connecting to a PostgreSQL database that enables native searches for molecules and reactions without complicated SQL syntax. The library provides out-of-the-box implementations for similarity and substructure searches for molecules, as well as similarity and substructure searches for reactions in two ways─based on reaction components and based on the Condensed Graph of Reaction approach, the latter significantly accelerating the performance. In benchmarking studies with the RDKit database cartridge, we demonstrate that CGRdb2.0 performs searches faster for smaller data sets, while allowing for interactive access to the retrieved data.


Subject(s)
Benchmarking , Database Management Systems , Databases, Factual
4.
Mol Inform ; 41(4): e2100138, 2022 04.
Article in English | MEDLINE | ID: mdl-34726834

ABSTRACT

In this paper, we compare the most popular Atom-to-Atom Mapping (AAM) tools: ChemAxon,[1] Indigo,[2] RDTool,[3] NameRXN (NextMove),[4] and RXNMapper[5] which implement different AAM algorithms. An open-source RDTool program was optimized, and its modified version ("new RDTool") was considered together with several consensus mapping strategies. The Condensed Graph of Reaction approach was used to calculate chemical distances and develop the "AAM fixer" algorithm for an automatized correction of erroneous mapping. The benchmarking calculations were performed on a Golden dataset containing 1851 manually mapped and curated reactions. The best performing RXNMapper program together with the AMM Fixer was applied to map the USPTO database. The Golden dataset, mapped USPTO and optimized RDTool are available in the GitHub repository https://github.com/Laboratoire-de-Chemoinformatique.


Subject(s)
Benchmarking , Biochemical Phenomena , Algorithms , Databases, Factual
5.
Mol Inform ; 40(12): e2100119, 2021 12.
Article in English | MEDLINE | ID: mdl-34427989

ABSTRACT

The quality of experimental data for chemical reactions is a critical consideration for any reaction-driven study. However, the curation of reaction data has not been extensively discussed in the literature so far. Here, we suggest a 4 steps protocol that includes the curation of individual structures (reactants and products), chemical transformations, reaction conditions and endpoints. Its implementation in Python3 using CGRTools toolkit has been used to clean three popular reaction databases Reaxys, USPTO and Pistachio. The curated USPTO database is available in the GitHub repository (Laboratoire-de-Chemoinformatique/Reaction_Data_Cleaning).


Subject(s)
Data Curation , Databases, Factual , Reference Standards
6.
Sci Rep ; 11(1): 3178, 2021 02 04.
Article in English | MEDLINE | ID: mdl-33542271

ABSTRACT

The "creativity" of Artificial Intelligence (AI) in terms of generating de novo molecular structures opened a novel paradigm in compound design, weaknesses (stability & feasibility issues of such structures) notwithstanding. Here we show that "creative" AI may be as successfully taught to enumerate novel chemical reactions that are stoichiometrically coherent. Furthermore, when coupled to reaction space cartography, de novo reaction design may be focused on the desired reaction class. A sequence-to-sequence autoencoder with bidirectional Long Short-Term Memory layers was trained on on-purpose developed "SMILES/CGR" strings, encoding reactions of the USPTO database. The autoencoder latent space was visualized on a generative topographic map. Novel latent space points were sampled around a map area populated by Suzuki reactions and decoded to corresponding reactions. These can be critically analyzed by the expert, cleaned of irrelevant functional groups and eventually experimentally attempted, herewith enlarging the synthetic purpose of popular synthetic pathways.

7.
J Chem Inf Model ; 61(2): 554-559, 2021 02 22.
Article in English | MEDLINE | ID: mdl-33502186

ABSTRACT

Presently, quantum chemical calculations are widely used to generate extensive data sets for machine learning applications; however, generally, these sets only include information on equilibrium structures and some close conformers. Exploration of potential energy surfaces provides important information on ground and transition states, but analysis of such data is complicated due to the number of possible reaction pathways. Here, we present RePathDB, a database system for managing 3D structural data for both ground and transition states resulting from quantum chemical calculations. Our tool allows one to store, assemble, and analyze reaction pathway data. It combines relational database CGR DB for handling compounds and reactions as molecular graphs with a graph database architecture for pathway analysis by graph algorithms. Original condensed graph of reaction technology is used to store any chemical reaction as a single graph.


Subject(s)
Algorithms , Database Management Systems , Databases, Factual
8.
Int J Mol Sci ; 21(15)2020 Aug 03.
Article in English | MEDLINE | ID: mdl-32756326

ABSTRACT

Nowadays, the problem of the model's applicability domain (AD) definition is an active research topic in chemoinformatics. Although many various AD definitions for the models predicting properties of molecules (Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) models) were described in the literature, no one for chemical reactions (Quantitative Reaction-Property Relationships (QRPR)) has been reported to date. The point is that a chemical reaction is a much more complex object than an individual molecule, and its yield, thermodynamic and kinetic characteristics depend not only on the structures of reactants and products but also on experimental conditions. The QRPR models' performance largely depends on the way that chemical transformation is encoded. In this study, various AD definition methods extensively used in QSAR/QSPR studies of individual molecules, as well as several novel approaches suggested in this work for reactions, were benchmarked on several reaction datasets. The ability to exclude wrong reaction types, increase coverage, improve the model performance and detect Y-outliers were tested. As a result, several "best" AD definitions for the QRPR models predicting reaction characteristics have been revealed and tested on a previously published external dataset with a clear AD definition problem.


Subject(s)
Cheminformatics/trends , Protein Domains , Quantitative Structure-Activity Relationship , Thermodynamics , Chemical Phenomena , Kinetics , Models, Molecular
9.
J Chem Inf Model ; 59(11): 4569-4576, 2019 11 25.
Article in English | MEDLINE | ID: mdl-31638794

ABSTRACT

Here, we describe a concept of conjugated models for several properties (activities) linked by a strict mathematical relationship. This relationship can be directly integrated analytically into the ridge regression (RR) algorithm or accounted for in a special case of "twin" neural networks (NN). Developed approaches were applied to the modeling of the logarithm of the prototropic tautomeric constant (logKT) which can be expressed as the difference between the acidity constants (pKa) of two related tautomers. Both conjugated and individual RR and NN models for logKT and pKa were developed. The modeling set included 639 tautomeric constants and 2371 acidity constants of organic molecules in various solvents. A descriptor vector for each reaction resulted from the concatenation of structural descriptors and some parameters for reaction conditions. For the former, atom-centered substructural fragments describing acid sites in tautomer molecules were used. The latter were automatically identified using the condensed graph of reaction approach. Conjugated models performed similarly to the best individual models for logKT and pKa. At the same time, the physically grounded relationship between logKT and pKa was respected only for conjugated but not individual models.


Subject(s)
Organic Chemicals/chemistry , Pharmaceutical Preparations/chemistry , Acids/chemistry , Algorithms , Drug Discovery , Models, Chemical , Molecular Structure , Neural Networks, Computer , Quantitative Structure-Activity Relationship , Solvents/chemistry , Stereoisomerism
10.
J Chem Inf Model ; 59(6): 2516-2521, 2019 06 24.
Article in English | MEDLINE | ID: mdl-31063394

ABSTRACT

CGRtools is an open-source Python library aimed to handle molecular and reaction information. It is the sole library developed so far which can process condensed graph of reaction (CGR) handling. CGR provides the possibility for advanced operations with reaction information and could be used for reaction descriptor calculation, structure-reactivity modeling, atom-to-atom mapping comparison and correction, reaction center extraction, reaction balancing, and some other related tasks. Unlike other popular libraries, CGRtools is fully written in Python with minor dependencies on other libraries and cross-platform. Reaction, molecule, and CGR objects in CGRtools support native Python methods and are comparable with the help of operations "equal to", "less than", and "bigger than". CGRtools supports common structural formats. CGRtools is distributed via an L-GPL license and available on https://github.com/cimm-kzn/CGRtools .


Subject(s)
Cheminformatics/methods , Small Molecule Libraries/chemistry , Software , Chemical Phenomena , Models, Chemical
11.
Mol Inform ; 38(1-2): e1800077, 2019 01.
Article in English | MEDLINE | ID: mdl-30134047

ABSTRACT

This paper reports SVR (Support Vector Regression) and GTM (Generative Topographic Mapping) modeling of three kinetic properties of cycloaddition reactions: rate constant (logk), activation energy (Ea) and pre-exponential factor (logA). A data set of 1849 reactions, comprising (4+2), (3+2) and (2+2) cycloadditions (CA) were studied in different solvents and at different temperatures. The reactions were encoded by the ISIDA fragment descriptors generated for Condensed Graph of Reaction (CGR). For a given reaction, a CGR condenses structures of all the reactants and products into one single molecular graph, described both by conventional chemical bonds and "dynamical" bonds characterizing chemical transformations. Different scenarios of logk assessment were exploited: direct modeling, application of the Arrhenius equation and temperature-scaled GTM landscapes. The logk models with optimal cross-validated statistics (Q2 =0.78-0.94 RMSE=0.45-0.86) have been challenged to predict rates for the external test set of 200 reactions, comprising both reactions that were not present in the training set, and training set transformations performed under different reaction conditions. The models are freely available on our web-server: http://cimm.kpfu.ru/models.


Subject(s)
Cycloaddition Reaction/methods , Models, Chemical , Kinetics
12.
Mol Inform ; 38(4): e1800104, 2019 04.
Article in English | MEDLINE | ID: mdl-30468317

ABSTRACT

Here, we report the data visualization, analysis and modeling for a large set of 4830 SN 2 reactions the rate constant of which (logk) was measured at different experimental conditions (solvent, temperature). The reactions were encoded by one single molecular graph - Condensed Graph of Reactions, which allowed us to use conventional chemoinformatics techniques developed for individual molecules. Thus, Matched Reaction Pairs approach was suggested and used for the analyses of substituents effects on the substrates and nucleophiles reactivity. The data were visualized with the help of the Generative Topographic Mapping approach. Consensus Support Vector Regression (SVR) model for the rate constant was prepared. Unbiased estimation of the model's performance was made in cross-validation on reactions measured on unique structural transformations. The model's performance in cross-validation (RMSE=0.61 logk units) and on the external test set (RMSE=0.80) is close to the noise in data. Performances of the local models obtained for selected subsets of reactions proceeding in particular solvents or with particular type of nucleophiles were similar to that of the model built on the entire set. Finally, four different definitions of model's applicability domains for reactions were examined.


Subject(s)
Models, Chemical , Support Vector Machine , Hydrocarbons, Cyclic/chemistry , Kinetics , Oxidation-Reduction
13.
Mol Inform ; 37(9-10): e1800056, 2018 09.
Article in English | MEDLINE | ID: mdl-30039933

ABSTRACT

Generative Topographic Mapping (GTM) approach was successfully used to visualize, analyze and model the equilibrium constants (KT ) of tautomeric transformations as a function of both structure and experimental conditions. The modeling set contained 695 entries corresponding to 350 unique transformations of 10 tautomeric types, for which KT values were measured in different solvents and at different temperatures. Two types of GTM-based classification models were trained: first, a "structural" approach focused on separating tautomeric classes, irrespective of reaction conditions, then a "general" approach accounting for both structure and conditions. In both cases, the cross-validated Balanced Accuracy was close to 1 and the clusters, assembling equilibria of particular classes, were well separated in 2-dimentional GTM latent space. Data points corresponding to similar transformations measured under different experimental conditions, are well separated on the maps. Additionally, GTM-driven regression models were found to have their predictive performance dependent on different scenarios of the selection of local fragment descriptors involving special marked atoms (proton donors or acceptors). The application of local descriptors significantly improves the model performance in 5-fold cross-validation: RMSE=0.63 and 0.82 logKT units with and without local descriptors, respectively. This trend was as well observed for SVR calculations, performed for the comparison purposes.


Subject(s)
Algorithms , Molecular Dynamics Simulation , Organic Chemicals/chemistry , Isomerism , Solvents/chemistry
14.
J Comput Aided Mol Des ; 31(9): 829-839, 2017 Sep.
Article in English | MEDLINE | ID: mdl-28752345

ABSTRACT

We describe a novel approach of reaction representation as a combination of two mixtures: a mixture of reactants and a mixture of products. In turn, each mixture can be encoded using an earlier reported approach involving simplex descriptors (SiRMS). The feature vector representing these two mixtures results from either concatenated product and reactant descriptors or the difference between descriptors of products and reactants. This reaction representation doesn't need an explicit labeling of a reaction center. The rigorous "product-out" cross-validation (CV) strategy has been suggested. Unlike the naïve "reaction-out" CV approach based on a random selection of items, the proposed one provides with more realistic estimation of prediction accuracy for reactions resulting in novel products. The new methodology has been applied to model rate constants of E2 reactions. It has been demonstrated that the use of the fragment control domain applicability approach significantly increases prediction accuracy of the models. The models obtained with new "mixture" approach performed better than those required either explicit (Condensed Graph of Reaction) or implicit (reaction fingerprints) reaction center labeling.


Subject(s)
Models, Molecular , Organic Chemicals/chemistry , Kinetics , Molecular Structure , Quantitative Structure-Activity Relationship
SELECTION OF CITATIONS
SEARCH DETAIL
...