Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 4 de 4
Filter
Add more filters










Database
Language
Publication year range
1.
J Cheminform ; 13(1): 76, 2021 Oct 02.
Article in English | MEDLINE | ID: mdl-34600576

ABSTRACT

Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that the most known quantum chemical dataset QM9 lacks chemical diversity. As a consequence, ML models trained on QM9 showed generalizability shortcomings. In this paper we would like to present (i) a fast and generic method to evaluate chemical diversity, (ii) a new quantum chemical dataset of 435k molecules, OD9, that includes QM9 and new molecules generated with a diversity objective, (iii) an analysis of the diversity impact on unconstrained and goal-directed molecular generation on the example of QED optimization. Our innovative approach makes it possible to individually estimate the impact of a solution to the diversity of a set, allowing for effective incremental evaluation. In the first application, we will see how the diversity constraint allows us to generate more than a million of molecules that would efficiently complete the reference datasets. The compounds were calculated with DFT thanks to a collaborative effort through the QuChemPedIA@home BOINC project. With regard to goal-directed molecular generation, getting a high QED score is not complicated, but adding a little diversity can cut the number of calls to the evaluation function by a factor of ten.

2.
Anal Chem ; 92(13): 8793-8801, 2020 07 07.
Article in English | MEDLINE | ID: mdl-32479074

ABSTRACT

Whether chemists or biologists, researchers dealing with metabolomics require tools to decipher complex mixtures. As a part of metabolomics and initially dedicated to identifying bioactive natural products, dereplication aims at reducing the usual time-consuming process of known compounds isolation. Mass spectrometry and nuclear magnetic resonance are the most commonly reported analytical tools during dereplication analysis. Though it has low sensitivity, 13C NMR has many advantages for such a study. Notably, it is nonspecific allowing simultaneous high-resolution analysis of any organic compounds including stereoisomers. Since NMR spectrometers nowadays provide useful data sets in a reasonable time frame, we have embarked upon writing software dedicated to 13C NMR dereplication. The present study describes the development of a freely distributed algorithm, namely MixONat and its ability to help researchers decipher complex mixtures. Based on Python 3.5, MixONat analyses a {1H}-13C NMR spectrum optionally combined with DEPT-135 and 90 data-to distinguish carbon types (i.e., CH3, CH2, CH, and C)-as well as a MW filtering. The software requires predicted or experimental carbon chemical shifts (δc) databases and displays results that can be refined based on user interactions. As a proof of concept, this 13C NMR dereplication strategy was evaluated on mixtures of increasing complexity and exhibiting pharmaceutical (poppy alkaloids), nutritional (rosemary extracts) or cosmetics (mangosteen peel extract) applications. Associated results were compared with other methods commonly used for dereplication. MixONat gave coherent results that rapidly oriented the user toward the correct structural types of secondary metabolites, allowing the user to distinguish between structurally close natural products, including stereoisomers.


Subject(s)
Biological Products/chemistry , Magnetic Resonance Spectroscopy/methods , Software , Algorithms , Alkaloids/chemistry , Carbon Isotopes/chemistry , Databases, Chemical , Garcinia mangostana/chemistry , Garcinia mangostana/metabolism , Papaver/chemistry , Papaver/metabolism , Plant Extracts/chemistry , Rosmarinus/chemistry , Rosmarinus/metabolism
3.
J Cheminform ; 12(1): 55, 2020 Sep 16.
Article in English | MEDLINE | ID: mdl-33431049

ABSTRACT

The objective of this work is to design a molecular generator capable of exploring known as well as unfamiliar areas of the chemical space. Our method must be flexible to adapt to very different problems. Therefore, it has to be able to work with or without the influence of prior data and knowledge. Moreover, regardless of the success, it should be as interpretable as possible to allow for diagnosis and improvement. We propose here a new open source generation method using an evolutionary algorithm to sequentially build molecular graphs. It is independent of starting data and can generate totally unseen compounds. To be able to search a large part of the chemical space, we define an original set of 7 generic mutations close to the atomic level. Our method achieves excellent performances and even records on the QED, penalised logP, SAscore, CLscore as well as the set of goal-directed functions defined in GuacaMol. To demonstrate its flexibility, we tackle a very different objective issued from the organic molecular materials domain. We show that EvoMol can generate sets of optimised molecules having high energy HOMO or low energy LUMO, starting only from methane. We can also set constraints on a synthesizability score and structural features. Finally, the interpretability of EvoMol allows for the visualisation of its exploration process as a chemically relevant tree.

4.
J Cheminform ; 11(1): 69, 2019 Nov 12.
Article in English | MEDLINE | ID: mdl-33430991

ABSTRACT

The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 "heavy" atoms) of the PubChemQC project is presented in this article. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset.

SELECTION OF CITATIONS
SEARCH DETAIL
...