Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
J Chem Inf Model ; 64(4): 1107-1111, 2024 Feb 26.
Article in English | MEDLINE | ID: mdl-38346241

ABSTRACT

There has been a growing recognition of the need for diversity and inclusion in scientific fields. This trend is reflected in the Journal of Chemical Information and Modeling (JCIM), where there has been a gradual increase in the number of papers that embrace this diversity. In this viewpoint, we analyze the evolution of the profile of papers published in JCIM from 1996 to 2022 addressing three diversity criteria, namely interdisciplinarity, geographic and gender distributions, and their impact on citation patterns. We used natural language processing tools for the classification of main areas and gender, as well as metadata, to analyze a total of 7384 articles published in the categories of research articles, reviews, and brief reports. Our analyses reveal that the relative number of articles and citation patterns are similar across the main areas within the scope of JCIM, and international collaboration and publications encompassing two to three research areas attract more citations. The percentage of female authors has increased from 1996 (less than 20%) to 2022 (more than 32%), indicating a positive trend toward gender diversity in almost all geographic regions, although the percentage of publications by single female authors remains lower than 20%. Most JCIM citations come from Europe and the Americas, with a tendency for JCIM papers to cite articles from the same continent. Furthermore, there is a correlation between the gender of the authors, as JCIM manuscripts authored by females are more likely to be cited by other JCIM manuscripts authored by females.


Subject(s)
Models, Chemical , Natural Language Processing , Female , Humans
2.
J Chem Inf Model ; 63(17): 5446-5456, 2023 09 11.
Article in English | MEDLINE | ID: mdl-37625081

ABSTRACT

A key aspect of producing accurate and reliable machine learning models for the prediction of properties of quantum chemistry (QC) data is identifying possible data characteristics that may negatively influence model training. In previous work, we identified that molecules and materials with a low volume of the convex hull (VCH) of atomic positions may be harmful in model training and a source of prediction outliers. In this paper, we extend this analysis further and develop a biased sampling study to evaluate the influence of VCH on the training data of a model using different structures of molecules and materials. Our study confirms that VCH influences model training and shows the importance of using homogeneous geometric characteristics of structures when building new data sets or selecting training sets from larger QC data sets.


Subject(s)
Machine Learning , Molecular Structure
3.
J Chem Inf Model ; 61(9): 4210-4223, 2021 09 27.
Article in English | MEDLINE | ID: mdl-34387994

ABSTRACT

Most machine learning applications in quantum-chemistry (QC) data sets rely on a single statistical error parameter such as the mean square error (MSE) to evaluate their performance. However, this approach has limitations or can even yield incorrect interpretations. Here, we report a systematic investigation of the two components of the MSE, i.e., the bias and variance, using the QM9 data set. To this end, we experiment with three descriptors, namely (i) symmetry functions (SF, with two-body and three-body functions), (ii) many-body tensor representation (MBTR, with two- and three-body terms), and (iii) smooth overlap of atomic positions (SOAP), to evaluate the prediction process's performance using different numbers of molecules in training samples and the effect of bias and variance on the final MSE. Overall, low sample sizes are related to higher MSE. Moreover, the bias component strongly influences the larger MSEs. Furthermore, there is little agreement among molecules with higher errors (outliers) across different descriptors. However, there is a high prevalence among the outliers intersection set and the convex hull volume of geometric coordinates (VCH). According to the obtained results with the distribution of MSE (and its components bias and variance) and the appearance of outliers, it is suggested to use ensembles of models with a low bias to minimize the MSE, more specifically when using a small number of molecules in the training set.


Subject(s)
Algorithms , Machine Learning , Bias
4.
J Chem Inf Model ; 61(3): 1125-1135, 2021 03 22.
Article in English | MEDLINE | ID: mdl-33685128

ABSTRACT

The amount of quantum chemistry (QC) data is increasing year by year due to the continuous increase of computational power and development of new algorithms. However, in most cases, our atom-level knowledge of molecular systems has been obtained by manual data analyses based on selected descriptors. In this work, we introduce a data mining framework to accelerate the extraction of insights from QC datasets, which starts with a featurization process that converts atomic features into molecular properties (AtoMF). Then, it employs correlation coefficients (Pearson, Spearman, and Kendall) to investigate the AtoMF features relationship with a target property. We applied our framework to investigate three nanocluster systems, namely, PtnTM55-n, CenZr15-nO30, and (CHn + mH)/TM13. We found several interesting and consistent insights using Spearman and Kendall correlation coefficients, indicating that they are suitable for our approach; however, our results indicate that the Pearson coefficient is very sensitive to outliers and should not be used. Moreover, we highlight problems that can occur during this analysis and discuss how to handle them. Finally, we make available a new Python package that implements the proposed QC data mining framework, which can be used as is or modified to include new features.


Subject(s)
Algorithms , Data Mining , Research Design
5.
J Phys Chem A ; 124(47): 9854-9866, 2020 Nov 25.
Article in English | MEDLINE | ID: mdl-33174750

ABSTRACT

Machine learning (ML) models can potentially accelerate the discovery of tailored materials by learning a function that maps chemical compounds into their respective target properties. In this realm, a crucial step is encoding the molecular systems into the ML model, in which the molecular representation plays a crucial role. Most of the representations are based on the use of atomic coordinates (structure); however, it can increase ML training and predictions' computational cost. Herein, we investigate the impact of choosing free-coordinate descriptors based on the Simplified Molecular Input Line Entry System (SMILES) representation, which can substantially reduce the ML predictions' computational cost. Therefore, we evaluate a feed-forward neural network (FNN) model's prediction performance over five feature selection methods and nine ground-state properties (including energetic, electronic, and thermodynamic properties) from a public data set composed of ∼130k organic molecules. Our best results reached a mean absolute error, close to chemical accuracy, of ∼0.05 eV for the atomization energies (internal energy at 0 K, internal energy at 298.15 K, enthalpy at 298.15 K, and free energy at 298.15 K). Moreover, for the atomization energies, the results obtained an out-of-sample error nine times less than the same FNN model trained with the Coulomb matrix, a traditional coordinate-based descriptor. Furthermore, our results showed how limited the model's accuracy is by employing such low computational cost representation that carries less information about the molecular structure than the most state-of-the-art methods.

6.
Phys Chem Chem Phys ; 21(48): 26637-26646, 2019 Dec 11.
Article in English | MEDLINE | ID: mdl-31774074

ABSTRACT

Mixed CeO2-ZrO2 nanoclusters have the potential to play a crucial role in nanocatalysis, however, the atomistic understanding of those nanoclusters is far from satisfactory. In this work, we report a density functional theory investigation combined with Spearman rank correlation analysis of the energetic, structural and electronic properties of mixed CenZr15-nO30 nanoclusters as a function of the composition (n = 0, 1,…,14, 15). For instance, we found a negative excess energy for all putative global minimum CenZr15-nO30 configurations with a minimum at about n = 6 (i.e., nearly 40% Ce), in which both the oxygen anion surroundings and cation radii play a crucial role in the stability and distribution of the chemical species. We found a strong energetic preference of Zr4+ cations to occupy larger coordination number sites, i.e., the nanocluster core region, while the Ce4+ cations are located near vacuum exposed O-rich regions. As expected, we obtained an almost linear decrease of the average bond lengths by replacing Ce4+ by Zr4+ cations in the (ZrO2)15 nanoclusters towards the formation of mixed CenZr15-nO30 nanoclusters, which resulted in a shift towards higher vibrational frequencies. Besides, we also observed that the relative stability of the mixed oxides is directly correlated with the increase (decrease) of the Zr d-state (Ce f-state) contribution to the highest occupied molecular orbital with the increase of the Zr content, hence driving the gap energy towards higher values.

SELECTION OF CITATIONS
SEARCH DETAIL
...