Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 57
Filter
Add more filters










Publication year range
1.
J Chem Inf Model ; 2024 Jul 12.
Article in English | MEDLINE | ID: mdl-38995078

ABSTRACT

Machine learning-driven computer-aided synthesis planning (CASP) tools have become important tools for idea generation in the design of complex molecule synthesis but do not adequately address the stereochemical features of the target compounds. A novel approach to automated extraction of templates used in CASP that includes stereochemical information included in the US Patent and Trademark Office (USPTO) and an internal AstraZeneca database containing reactions from Reaxys, Pistachio, and AstraZeneca electronic lab notebooks is implemented in the freely available AiZynthFinder software. Three hundred sixty-seven templates covering reagent- and substrate-controlled as well as stereospecific reactions were extracted from the USPTO, while 20,724 templates were from the AstraZeneca database. The performance of these templates in multistep CASP is evaluated for 936 targets from the ChEMBL database and an in-house selection of 791 AZ designs. The potential and limitations are discussed for four case studies from ChEMBL and examples of FDA-approved drugs.

2.
J Cheminform ; 16(1): 57, 2024 May 23.
Article in English | MEDLINE | ID: mdl-38778382

ABSTRACT

We present an updated overview of the AiZynthFinder package for retrosynthesis planning. Since the first version was released in 2020, we have added a substantial number of new features based on user feedback. Feature enhancements include policies for filter reactions, support for any one-step retrosynthesis model, a scoring framework and several additional search algorithms. To exemplify the typical use-cases of the software and highlight some learnings, we perform a large-scale analysis on several hundred thousand target molecules from diverse sources. This analysis looks at for instance route shape, stock usage and exploitation of reaction space, and points out strengths and weaknesses of our retrosynthesis approach. The software is released as open-source for educational purposes as well as to provide a reference implementation of the core algorithms for synthesis prediction. We hope that releasing the software as open-source will further facilitate innovation in developing novel methods for synthetic route prediction. AiZynthFinder is a fast, robust and extensible open-source software and can be downloaded from https://github.com/MolecularAI/aizynthfinder .

3.
J Cheminform ; 16(1): 39, 2024 Apr 04.
Article in English | MEDLINE | ID: mdl-38576047

ABSTRACT

Stakeholders of machine learning models desire explainable artificial intelligence (XAI) to produce human-understandable and consistent interpretations. In computational toxicity, augmentation of text-based molecular representations has been used successfully for transfer learning on downstream tasks. Augmentations of molecular representations can also be used at inference to compare differences between multiple representations of the same ground-truth. In this study, we investigate the robustness of eight XAI methods using test-time augmentation for a molecular-representation model in the field of computational toxicity prediction. We report significant differences between explanations for different representations of the same ground-truth, and show that randomized models have similar variance. We hypothesize that text-based molecular representations in this and past research reflect tokenization more than learned parameters. Furthermore, we see a greater variance between in-domain predictions than out-of-domain predictions, indicating XAI measures something other than learned parameters. Finally, we investigate the relative importance given to expert-derived structural alerts and find similar importance given irregardless of applicability domain, randomization and varying training procedures. We therefore caution future research to validate their methods using a similar comparison to human intuition without further investigation. SCIENTIFIC CONTRIBUTION: In this research we critically investigate XAI through test-time augmentation, contrasting previous assumptions about using expert validation and showing inconsistencies within models for identical representations. SMILES augmentation has been used to increase model accuracy, but was here adapted from the field of image test-time augmentation to be used as an independent indication of the consistency within SMILES-based molecular representation models.

4.
J Chem Inf Model ; 64(8): 3021-3033, 2024 Apr 22.
Article in English | MEDLINE | ID: mdl-38602390

ABSTRACT

Synthesis planning of new pharmaceutical compounds is a well-known bottleneck in modern drug design. Template-free methods, such as transformers, have recently been proposed as an alternative to template-based methods for single-step retrosynthetic predictions. Here, we trained and evaluated a transformer model, called the Chemformer, for retrosynthesis predictions within drug discovery. The proprietary data set used for training comprised ∼18 M reactions from literature, patents, and electronic lab notebooks. Chemformer was evaluated for the purpose of both single-step and multistep retrosynthesis. We found that the single-step performance of Chemformer was especially good on reaction classes common in drug discovery, with most reaction classes showing a top-10 round-trip accuracy above 0.97. Moreover, Chemformer reached a higher round-trip accuracy compared to that of a template-based model. By analyzing multistep retrosynthesis experiments, we observed that Chemformer found synthetic routes, leading to commercial starting materials for 95% of the target compounds, an increase of more than 20% compared to the template-based model on a proprietary compound data set. In addition to this, we discovered that Chemformer suggested novel disconnections corresponding to reaction templates, which are not included in the template-based model. These findings were further supported by a publicly available ChEMBL compound data set. The conclusions drawn from this work allow for the design of a synthesis planning tool where template-based and template-free models work in harmony to optimize retrosynthetic recommendations.


Subject(s)
Drug Discovery , Drug Discovery/methods , Organic Chemicals/chemistry , Organic Chemicals/chemical synthesis , Models, Chemical
5.
J Chem Inf Model ; 64(1): 42-56, 2024 Jan 08.
Article in English | MEDLINE | ID: mdl-38116926

ABSTRACT

Machine Learning (ML) techniques face significant challenges when predicting advanced chemical properties, such as yield, feasibility of chemical synthesis, and optimal reaction conditions. These challenges stem from the high-dimensional nature of the prediction task and the myriad essential variables involved, ranging from reactants and reagents to catalysts, temperature, and purification processes. Successfully developing a reliable predictive model not only holds the potential for optimizing high-throughput experiments but can also elevate existing retrosynthetic predictive approaches and bolster a plethora of applications within the field. In this review, we systematically evaluate the efficacy of current ML methodologies in chemoinformatics, shedding light on their milestones and inherent limitations. Additionally, a detailed examination of a representative case study provides insights into the prevailing issues related to data availability and transferability in the discipline.


Subject(s)
Cheminformatics , Machine Learning
6.
Mol Inform ; 42(11): e202300128, 2023 Nov.
Article in English | MEDLINE | ID: mdl-37679293

ABSTRACT

The multi-step retrosynthesis problem can be solved by a search algorithm, such as Monte Carlo tree search (MCTS). The performance of multistep retrosynthesis, as measured by a trade-off in search time and route solvability, therefore depends on the hyperparameters of the search algorithm. In this paper, we demonstrated the effect of three MCTS hyperparameters (number of iterations, tree depth, and tree width) on metrics such as Linear integrated speed-accuracy score (LISAS) and Inverse efficiency score which consider both route solvability and search time. This exploration was conducted by employing three data-driven approaches, namely a systematic grid search, Bayesian optimization over an ensemble of molecules to obtain static MCTS hyperparameters, and a machine learning approach to dynamically predict optimal MCTS hyperparameters given an input target molecule. With the obtained results on the internal dataset, we demonstrated that it is possible to identify a hyperparameter set which outperforms the current AiZynthFinder default setting. It appeared optimal across a variety of target input molecules, both on proprietary and public datasets. The settings identified with the in-house dataset reached a solvability of 93 % and median search time of 151 s for the in-house dataset, and a 74 % solvability and 114 s for the ChEMBL dataset. These numbers can be compared to the current default settings which solved 85 % and 73 % during a median time of 110s and 84 s, for in-house and ChEMBL, respectively.


Subject(s)
Algorithms , Benchmarking , Bayes Theorem , Machine Learning , Monte Carlo Method
9.
J Chem Inf Model ; 63(7): 1841-1846, 2023 04 10.
Article in English | MEDLINE | ID: mdl-36959737

ABSTRACT

We introduce the AiZynthTrain Python package for training synthesis models in a robust, reproducible, and extensible way. It contains two pipelines that create a template-based one-step retrosynthesis model and a RingBreaker model that can be straightforwardly integrated in retrosynthesis software. We train such models on the publicly available reaction data set from the U.S. Patent and Trademark Office (USPTO), and these are the first retrosynthesis models created in a completely reproducible end-to-end fashion, starting with the original reaction data source and ending with trained machine-learning models. In particular, we show that employing new heuristics implemented in the pipeline greatly improves the ability of the RingBreaker model for disconnecting ring systems. Furthermore, we demonstrate the robustness of the pipeline by training on a more diverse but proprietary data set. We envisage that this framework will be extended with other synthesis models in the future.


Subject(s)
Machine Learning , Software
10.
Mol Inform ; 41(8): e2100294, 2022 08.
Article in English | MEDLINE | ID: mdl-35122702

ABSTRACT

We present machine learning models for predicting the chemical context for Buchwald-Hartwig coupling reactions, i. e., what chemicals to add to the reactants to give a productive reaction. Using reaction data from in-house electronic lab notebooks, we train two models: one based on single-label data and one based on multi-label data. Both models show excellent top-3 accuracy of approximately 90 %, which suggests strong predictivity. Furthermore, there seems to be an advantage of including multi-label data because the multi-label model shows higher accuracy and better sensitivity for the individual contexts than the single-label model. Although the models are performant, we also show that such models need to be re-trained periodically as there is a strong temporal characteristic to the usage of different contexts. Therefore, a model trained on historical data will decrease in usefulness with time as newer and better contexts emerge and replace older ones. We hypothesize that such significant transitions in the context-usage will likely affect any model predicting chemical contexts trained on historical data. Consequently, training context prediction models warrants careful planning of what data is used for training and how often the model needs to be re-trained.


Subject(s)
Machine Learning
11.
Sci Rep ; 11(1): 17333, 2021 08 30.
Article in English | MEDLINE | ID: mdl-34462478

ABSTRACT

The use of lignocellulosic-based fermentation media will be a necessary part of the transition to a circular bio-economy. These media contain many inhibitors to microbial growth, including acetic acid. Under industrially relevant conditions, acetic acid enters the cell predominantly through passive diffusion across the plasma membrane. The lipid composition of the membrane determines the rate of uptake of acetic acid, and thicker, more rigid membranes impede passive diffusion. We hypothesized that the elongation of glycerophospholipid fatty acids would lead to thicker and more rigid membranes, reducing the influx of acetic acid. Molecular dynamics simulations were used to predict the changes in membrane properties. Heterologous expression of Arabidopsis thaliana genes fatty acid elongase 1 (FAE1) and glycerol-3-phosphate acyltransferase 5 (GPAT5) increased the average fatty acid chain length. However, this did not lead to a reduction in the net uptake rate of acetic acid. Despite successful strain engineering, the net uptake rate of acetic acid did not decrease. We suggest that changes in the relative abundance of certain membrane lipid headgroups could mitigate the effect of longer fatty acid chains, resulting in a higher net uptake rate of acetic acid.


Subject(s)
Cell Membrane/metabolism , Fatty Acids/metabolism , Metabolic Engineering/methods , Saccharomyces cerevisiae/physiology , 1-Acylglycerol-3-Phosphate O-Acyltransferase/metabolism , Acetic Acid/chemistry , Acetic Acid/metabolism , Arabidopsis/enzymology , Arabidopsis Proteins/metabolism , Diffusion , Fatty Acid Elongases/metabolism , Fermentation , Glycerophospholipids/chemistry , Kinetics , Lignin/chemistry , Lipid Metabolism , Lipidomics , Lipids/chemistry , Molecular Dynamics Simulation , Plasmids/metabolism
12.
J Chem Inf Model ; 61(8): 3899-3907, 2021 08 23.
Article in English | MEDLINE | ID: mdl-34342428

ABSTRACT

We present a novel algorithm to compute the distance between synthetic routes based on tree edit distances. Such distances can be used to cluster synthesis routes generated using a retrosynthesis prediction tool. We show that the clustering of selected routes from a retrosynthesis analysis is performed in less than 10 s on average and only constitutes seven percent of the total time (prediction + clustering). Furthermore, we are able to show that representative routes from each cluster can be used to reduce the set of predicted routes. Finally, we show with a number of examples that the algorithm gives intuitive clusters that can be easily rationalized and that the routes in a cluster tend to use similar chemistry. The algorithm is included in the latest version of open-source AiZynthFinder software (https://github.com/MolecularAI/aizynthfinder) and as a separate package (https://github.com/MolecularAI/route-distances).


Subject(s)
Software , Algorithms , Cluster Analysis
13.
J Cheminform ; 12(1): 70, 2020 Nov 17.
Article in English | MEDLINE | ID: mdl-33292482

ABSTRACT

We present the open-source AiZynthFinder software that can be readily used in retrosynthetic planning. The algorithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable precursors. The tree search is guided by an artificial neural network policy that suggests possible precursors by utilizing a library of known reaction templates. The software is fast and can typically find a solution in less than 10 s and perform a complete search in less than 1 min. Moreover, the development of the code was guided by a range of software engineering principles such as automatic testing, system design and continuous integration leading to robust software with high maintainability. Finally, the software is well documented to make it suitable for beginners. The software is available at http://www.github.com/MolecularAI/aizynthfinder .

14.
J Phys Chem B ; 123(17): 3679-3687, 2019 05 02.
Article in English | MEDLINE | ID: mdl-30964287

ABSTRACT

The sugar molecule trehalose has been proven to be an excellent stabilizing cosolute for the preservation of biological materials. However, the stabilizing mechanism of trehalose has been much debated during the previous decades, and it is still not fully understood, partly because it has not been completely established how trehalose molecules structure around proteins. Here, we present a molecular model of a protein-water-trehalose system, based on neutron scattering results obtained from neutron diffraction, quasielastic neutron scattering, and different computer modeling techniques. The structural data clearly show how the proteins are preferentially hydrated, and analysis of the dynamical properties show that the protein residues are slowed down because of reduced dynamics of the protein hydration shell, rather than because of direct trehalose-protein interactions. These findings, thereby, strongly support previous models related to the preferential hydration model and contradict other models based on water replacement at the protein surface. Furthermore, the results are important for understanding the specific role of trehalose in biological stabilization and, more generally, for providing a likely mechanism of how cosolutes affect the dynamics of proteins.


Subject(s)
Proteins/chemistry , Trehalose/chemistry , Models, Molecular , Molecular Dynamics Simulation , Neutron Diffraction , Particle Size , Protein Stability , Scattering, Radiation , Surface Properties , Water/chemistry
15.
Drug Discov Today Technol ; 32-33: 65-72, 2019 Dec.
Article in English | MEDLINE | ID: mdl-33386096

ABSTRACT

Application of AI technologies in synthesis prediction has developed very rapidly in recent years. We attempt here to give a comprehensive summary on the latest advancement on retro-synthesis planning, forward synthesis prediction as well as quantum chemistry-based reaction prediction models. Besides an introduction on the AI/ML models for addressing various synthesis related problems, the sources of the reaction datasets used in model building is also covered. In addition to the predictive models, the robotics based high throughput experimentation technology will be another crucial factor for conducting synthesis in an automated fashion. Some state-of-the-art of high throughput experimentation practices carried out in the pharmaceutical industry are highlighted in this chapter to give the reader a sense of how future chemistry will be conducted to make compounds faster and cheaper.


Subject(s)
Artificial Intelligence , Computer-Aided Design , Synthetic Drugs/chemistry , Humans
16.
J Chem Inf Model ; 57(11): 2865-2873, 2017 11 27.
Article in English | MEDLINE | ID: mdl-29076739

ABSTRACT

We have investigated whether alchemical free-energy perturbation calculations of relative binding energies can be sped up by simulating a truncated protein. Previous studies with spherical nonperiodic systems showed that the number of simulated atoms could be reduced by a factor of 26 without affecting the calculated binding free energies by more than 0.5 kJ/mol on average ( Genheden, S.; Ryde, U. J. Chem. Theory Comput. 2012 , 8 , 1449 ), leading to a 63-fold decrease in the time consumption. However, such simulations are rather slow, owing to the need of a large cutoff radius for the nonbonded interactions. Periodic simulations with the electrostatics treated by Ewald summation are much faster. Therefore, we have investigated if a similar speed-up can be obtained also for periodic simulations. Unfortunately, our results show that it is harder to truncate periodic systems and that the truncation errors are larger for these systems. In particular, residues need to be removed from the calculations, which means that atoms have to be restrained to avoid that they move in an unrealistic manner. The results strongly depend on the strength on this restraint. For the binding of seven ligands to dihydrofolate reductase and ten inhibitors of blood-clotting factor Xa, the best results are obtained with a small restraining force constant. However, the truncation errors were still significant (e.g., 1.5-2.9 kJ/mol at a truncation radius of 10 Å). Moreover, the gain in computer time was only modest. On the other hand, if the snapshots are truncated after the MD simulations, the truncation errors are small (below 0.9 kJ/mol even for a truncation radius of 10 Å). This indicates that postprocessing with a more accurate energy function (e.g., with quantum chemistry) on truncated snapshots may be a viable approach.


Subject(s)
Molecular Dynamics Simulation , Static Electricity , Factor Xa/chemistry , Factor Xa/metabolism , Ligands , Protein Conformation , Tetrahydrofolate Dehydrogenase/chemistry , Tetrahydrofolate Dehydrogenase/metabolism , Thermodynamics
17.
J Comput Aided Mol Des ; 31(10): 867-876, 2017 Oct.
Article in English | MEDLINE | ID: mdl-28875361

ABSTRACT

We present the estimation of solvation free energies of small solutes in water, n-octanol and hexane using molecular dynamics simulations with two MARTINI models at different resolutions, viz. the coarse-grained (CG) and the hybrid all-atom/coarse-grained (AA/CG) models. From these estimates, we also calculate the water/hexane and water/octanol partition coefficients. More than 150 small, organic molecules were selected from the Minnesota solvation database and parameterized in a semi-automatic fashion. Using either the CG or hybrid AA/CG models, we find considerable deviations between the estimated and experimental solvation free energies in all solvents with mean absolute deviations larger than 10 kJ/mol, although the correlation coefficient is between 0.55 and 0.75 and significant. There is also no difference between the results when using the non-polarizable and polarizable water model, although we identify some improvements when using the polarizable model with the AA/CG solutes. In contrast to the estimated solvation energies, the estimated partition coefficients are generally excellent with both the CG and hybrid AA/CG models, giving mean absolute deviations between 0.67 and 0.90 log units and correlation coefficients larger than 0.85. We analyze the error distribution further and suggest avenues for improvements.


Subject(s)
Molecular Dynamics Simulation , Solvents/chemistry , 1-Octanol/chemistry , Hexanes/chemistry , Hydrophobic and Hydrophilic Interactions , Solutions/chemistry , Thermodynamics , Water/chemistry
18.
Microb Cell ; 5(1): 42-55, 2017 Dec 01.
Article in English | MEDLINE | ID: mdl-29354649

ABSTRACT

Microbial cell factories with the ability to maintain high productivity in the presence of weak organic acids, such as acetic acid, are required in many industrial processes. For example, fermentation media derived from lignocellulosic biomass are rich in acetic acid and other weak acids. The rate of diffusional entry of acetic acid is one parameter determining the ability of microorganisms to tolerance the acid. The present study demonstrates that the rate of acetic acid diffusion in S. cerevisiae is strongly affected by the alcohols ethanol and n-butanol. Ethanol of 40 g/L and n-butanol of 8 g/L both caused a 65% increase in the rate of acetic acid diffusion, and higher alcohol concentrations caused even greater increases. Molecular dynamics simulations of membrane dynamics in the presence of alcohols demonstrated that the partitioning of alcohols to the head group region of the lipid bilayer causes a considerable increase in the membrane area, together with reduced membrane thickness and lipid order. These changes in physiochemical membrane properties lead to an increased number of water molecules in the membrane interior, providing biophysical mechanisms for the alcohol-induced increase in acetic acid diffusion rate. n-butanol affected S. cerevisiae and the cell membrane properties at lower concentrations than ethanol, due to greater and deeper partitioning in the membrane. This study demonstrates that the rate of acetic acid diffusion can be strongly affected by compounds that partition into the cell membrane, and highlights the need for considering interaction effects between compounds in the design of microbial processes.

19.
Biochim Biophys Acta Biomembr ; 1859(2): 268-281, 2017 Feb.
Article in English | MEDLINE | ID: mdl-27919726

ABSTRACT

G protein coupled receptors (GPCRs) are located in membranes rich in cholesterol. The membrane spanning surfaces of GPCRs contain exposed backbone carbonyl groups and residue side chains potentially capable of forming hydrogen bonds to cholesterol molecules buried deep within the hydrophobic core of the lipid bilayer. Coarse-grained molecular dynamics (CGMD) simulations allow the observation of GPCRs in cholesterol-containing lipid bilayers for long times (50µs), sufficient to ensure equilibration of the system. We have detected a number of deep cholesterol binding sites on ß2 adrenergic and A2A adenosine receptors, and shown changes in these sites on agonist binding. The requirements for binding are modest, just a potential hydrogen bond partner close to a cleft or hole in the surface. This makes it likely that similar binding sites for cholesterol will exist on other classes of membrane protein.


Subject(s)
Cholesterol/chemistry , Cholesterol/metabolism , Lipid Bilayers/chemistry , Membrane Proteins/metabolism , Membranes/chemistry , Membranes/metabolism , Receptors, G-Protein-Coupled/chemistry , Receptors, G-Protein-Coupled/metabolism , Binding Sites , Hydrogen Bonding , Hydrophobic and Hydrophilic Interactions , Lipid Bilayers/metabolism , Molecular Dynamics Simulation , Protein Binding/physiology , Receptors, Adrenergic, beta-2/metabolism
20.
J Mol Graph Model ; 71: 80-87, 2017 01.
Article in English | MEDLINE | ID: mdl-27855339

ABSTRACT

We probe the dynamics of the Bpti and Galectin-3 proteins using molecular dynamics simulations employing three water models at different levels of resolution, viz. the atomistic TIP4P-Ewald, the coarse-grained Elba and an implicit generalised Born model. The dynamics are quantified indirectly by model-free order parameters, S2 of the backbone NH and selected side-chain bond vectors, which also have been determined experimentally through NMR relaxation measurements. For the backbone, the order parameters produced with the three solvent models agree to a large extent with experiments, giving average unsigned deviations between 0.03 and 0.06. For the side-chains, for which the experimental data is incomplete, the deviations are considerably larger with mean deviations between 0.13 and 0.17. However, for both backbone and side-chains, it is difficult to pick a winner, as all models perform equally well overall. For a more complete set of side-chain vectors, we resort to analysing the variation among the estimates from different solvent models. Unfortunately, the variations are found to be sizeable with mean deviations between 0.11 and 0.15. Implications for computational assessment of protein dynamics are discussed.


Subject(s)
Galectin 3/chemistry , Protein Conformation , Water/chemistry , Aprotinin/chemistry , Magnetic Resonance Spectroscopy , Models, Molecular , Molecular Dynamics Simulation , Nuclear Magnetic Resonance, Biomolecular , Solvents/chemistry
SELECTION OF CITATIONS
SEARCH DETAIL
...