Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 317
Filter
1.
Nat Commun ; 15(1): 3408, 2024 Apr 22.
Article in English | MEDLINE | ID: mdl-38649351

ABSTRACT

De novo drug design aims to generate molecules from scratch that possess specific chemical and pharmacological properties. We present a computational approach utilizing interactome-based deep learning for ligand- and structure-based generation of drug-like molecules. This method capitalizes on the unique strengths of both graph neural networks and chemical language models, offering an alternative to the need for application-specific reinforcement, transfer, or few-shot learning. It enables the "zero-shot" construction of compound libraries tailored to possess specific bioactivity, synthesizability, and structural novelty. In order to proactively evaluate the deep interactome learning framework for protein structure-based drug design, potential new ligands targeting the binding site of the human peroxisome proliferator-activated receptor (PPAR) subtype gamma are generated. The top-ranking designs are chemically synthesized and computationally, biophysically, and biochemically characterized. Potent PPAR partial agonists are identified, demonstrating favorable activity and the desired selectivity profiles for both nuclear receptors and off-target interactions. Crystal structure determination of the ligand-receptor complex confirms the anticipated binding mode. This successful outcome positively advocates interactome-based de novo design for application in bioorganic and medicinal chemistry, enabling the creation of innovative bioactive molecules.


Subject(s)
Deep Learning , Drug Design , PPAR gamma , Humans , Ligands , PPAR gamma/metabolism , PPAR gamma/agonists , PPAR gamma/chemistry , Binding Sites , Protein Binding
2.
Chembiochem ; : e202400095, 2024 Apr 29.
Article in English | MEDLINE | ID: mdl-38682398

ABSTRACT

Machine learning models support computer-aided molecular design and compound optimization. However, the initial phases of drug discovery often face a scarcity of training data for these models. Meta-learning has emerged as a potentially promising strategy, harnessing the wealth of structure-activity data available for known targets to facilitate efficient few-shot model training for the specific target of interest. In this study, we assessed the effectiveness of two different meta-learning methods, namely model-agnostic meta-learning (MAML) and adaptive deep kernel fitting (ADKF), specifically in the regression setting. We investigated how factors such as dataset size and the similarity of training tasks impact predictability. The results indicate that ADKF significantly outperformed both MAML and a single-task baseline model on the inhibition data. However, the performance of ADKF varied across different test tasks. Our findings suggest that considerable enhancements in performance can be anticipated primarily when the task of interest is similar to the tasks incorporated in the meta-learning process.

3.
Alzheimers Dement (N Y) ; 10(1): e12445, 2024.
Article in English | MEDLINE | ID: mdl-38528988

ABSTRACT

INTRODUCTION: Janus kinase (JAK) inhibitors were recently identified as promising drug candidates for repurposing in Alzheimer's disease (AD) due to their capacity to suppress inflammation via modulation of JAK/STAT signaling pathways. Besides interaction with primary therapeutic targets, JAK inhibitor drugs frequently interact with unintended, often unknown, biological off-targets, leading to associated effects. Nevertheless, the relevance of JAK inhibitors' off-target interactions in the context of AD remains unclear. METHODS: Putative off-targets of baricitinib and tofacitinib were predicted using a machine learning (ML) approach. After screening scientific literature, off-targets were filtered based on their relevance to AD. Targets that had not been previously identified as off-targets of baricitinib or tofacitinib were subsequently tested using biochemical or cell-based assays. From those, active concentrations were compared to bioavailable concentrations in the brain predicted by physiologically based pharmacokinetic (PBPK) modeling. RESULTS: With the aid of ML and in vitro activity assays, we identified two enzymes previously unknown to be inhibited by baricitinib, namely casein kinase 2 subunit alpha 2 (CK2-α2) and dual leucine zipper kinase (MAP3K12), both with binding constant (K d) values of 5.8 µM. Predicted maximum concentrations of baricitinib in brain tissue using PBPK modeling range from 1.3 to 23 nM, which is two to three orders of magnitude below the corresponding binding constant. CONCLUSION: In this study, we extended the list of baricitinib off-targets that are potentially relevant for AD progression and predicted drug distribution in the brain. The results suggest a low likelihood of successful repurposing in AD due to low brain permeability, even at the maximum recommended daily dose. While additional research is needed to evaluate the potential impact of the off-target interaction on AD, the combined approach of ML-based target prediction, in vitro confirmation, and PBPK modeling may help prioritize drugs with a high likelihood of being effectively repurposed for AD. Highlights: This study explored JAK inhibitors' off-targets in AD using a multidisciplinary approach.We combined machine learning, in vitro tests, and PBPK modelling to predict and validate new off-target interactions of tofacitinib and baricitinib in AD.Previously unknown inhibition of two enzymes (CK2-a2 and MAP3K12) by baricitinib were confirmed using in vitro experiments.Our PBPK model indicates that baricitinib low brain permeability limits AD repurposing.The proposed multidisciplinary approach optimizes drug repurposing efforts in AD research.

4.
RSC Adv ; 14(7): 4492-4502, 2024 Jan 31.
Article in English | MEDLINE | ID: mdl-38312732

ABSTRACT

Rational structure-based drug design relies on accurate predictions of protein-ligand binding affinity from structural molecular information. Although deep learning-based methods for predicting binding affinity have shown promise in computational drug design, certain approaches have faced criticism for their potential to inadequately capture the fundamental physical interactions between ligands and their macromolecular targets or for being susceptible to dataset biases. Herein, we propose to include bond-critical points based on the electron density of a protein-ligand complex as a fundamental physical representation of protein-ligand interactions. Employing a geometric deep learning model, we explore the usefulness of these bond-critical points to predict absolute binding affinities of protein-ligand complexes, benchmark model performance against existing methods, and provide a critical analysis of this new approach. The models achieved root-mean-squared errors of 1.4-1.8 log units on the PDBbind dataset, and 1.0-1.7 log units on the PDE10A dataset, not indicating significant advantages over benchmark methods, and thus rendering the utility of electron density for deep learning models context-dependent. The relationship between intermolecular electron density and corresponding binding affinity was analyzed, and Pearson correlation coefficients r > 0.7 were obtained for several macromolecular targets.

5.
Nat Rev Drug Discov ; 23(2): 141-155, 2024 02.
Article in English | MEDLINE | ID: mdl-38066301

ABSTRACT

Quantitative structure-activity relationship (QSAR) modelling, an approach that was introduced 60 years ago, is widely used in computer-aided drug design. In recent years, progress in artificial intelligence techniques, such as deep learning, the rapid growth of databases of molecules for virtual screening and dramatic improvements in computational power have supported the emergence of a new field of QSAR applications that we term 'deep QSAR'. Marking a decade from the pioneering applications of deep QSAR to tasks involved in small-molecule drug discovery, we herein describe key advances in the field, including deep generative and reinforcement learning approaches in molecular design, deep learning models for synthetic planning and the application of deep QSAR models in structure-based virtual screening. We also reflect on the emergence of quantum computing, which promises to further accelerate deep QSAR applications and the need for open-source and democratized resources to support computer-aided drug design.


Subject(s)
Deep Learning , Quantitative Structure-Activity Relationship , Humans , Artificial Intelligence , Computing Methodologies , Quantum Theory , Drug Discovery/methods , Drug Design
6.
Nat Chem ; 16(2): 239-248, 2024 Feb.
Article in English | MEDLINE | ID: mdl-37996732

ABSTRACT

Late-stage functionalization is an economical approach to optimize the properties of drug candidates. However, the chemical complexity of drug molecules often makes late-stage diversification challenging. To address this problem, a late-stage functionalization platform based on geometric deep learning and high-throughput reaction screening was developed. Considering borylation as a critical step in late-stage functionalization, the computational model predicted reaction yields for diverse reaction conditions with a mean absolute error margin of 4-5%, while the reactivity of novel reactions with known and unknown substrates was classified with a balanced accuracy of 92% and 67%, respectively. The regioselectivity of the major products was accurately captured with a classifier F-score of 67%. When applied to 23 diverse commercial drug molecules, the platform successfully identified numerous opportunities for structural diversification. The influence of steric and electronic information on model performance was quantified, and a comprehensive simple user-friendly reaction format was introduced that proved to be a key enabler for seamlessly integrating deep learning and high-throughput experimentation for late-stage functionalization.


Subject(s)
Deep Learning , High-Throughput Screening Assays
7.
Commun Chem ; 6(1): 256, 2023 Nov 20.
Article in English | MEDLINE | ID: mdl-37985850

ABSTRACT

Enhancing the properties of advanced drug candidates is aided by the direct incorporation of specific chemical groups, avoiding the need to construct the entire compound from the ground up. Nevertheless, their chemical intricacy often poses challenges in predicting reactivity for C-H activation reactions and planning their synthesis. We adopted a reaction screening approach that combines high-throughput experimentation (HTE) at a nanomolar scale with computational graph neural networks (GNNs). This approach aims to identify suitable substrates for late-stage C-H alkylation using Minisci-type chemistry. GNNs were trained using experimentally generated reactions derived from in-house HTE and literature data. These trained models were then used to predict, in a forward-looking manner, the coupling of 3180 advanced heterocyclic building blocks with a diverse set of sp3-rich carboxylic acids. This predictive approach aimed to explore the substrate landscape for Minisci-type alkylations. Promising candidates were chosen, their production was scaled up, and they were subsequently isolated and characterized. This process led to the creation of 30 novel, functionally modified molecules that hold potential for further refinement. These results positively advocate the application of HTE-based machine learning to virtual reaction screening.

8.
Mol Inform ; 42(6): e2300059, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37164908

ABSTRACT

Several binary molecular fingerprints were compressed using an autoencoder neural network. We analyzed the impact of compression on fingerprint performance in downstream classification and regression tasks. Classifiers trained on compressed fingerprints were negligibly affected. Regression models benefitted from compression, especially of long fingerprints (Morgan, RDK). However, their performance dropped rapidly for compression levels exceeding 90 %. Property co-learning positively influenced the predictive power of the compressed fingerprints, with a mean score improvement up to 20 %, suggesting that autoencoder compression with property co-learning biases the molecular representation toward the predicted target, facilitating downstream training.


Subject(s)
Algorithms , Neural Networks, Computer , Machine Learning
9.
Biochem Pharmacol ; 211: 115504, 2023 05.
Article in English | MEDLINE | ID: mdl-36921634

ABSTRACT

Integrins are a family of cell surface receptors well-recognized for their therapeutic potential in a wide range of diseases. However, the development of integrin targeting medications has been impacted by unexpected downstream effects, reflecting originally unforeseen interference with the bidirectional signalling and cross-communication of integrins. We here selected one of the most severely affected target integrins, the integrin lymphocyte function-associated antigen-1 (LFA-1, αLß2, CD11a/CD18), as a prototypic integrin to systematically assess and overcome these known shortcomings. We employed a two-tiered ligand-based virtual screening approach to identify a novel class of allosteric small molecule inhibitors targeting this integrin's αI domain. The newly discovered chemical scaffold was derivatized, yielding potent bis-and tris-aryl-bicyclic-succinimides which inhibit LFA-1 in vitro at low nanomolar concentrations. The characterisation of these compounds in comparison to earlier LFA-1 targeting modalities established that the allosteric LFA-1 inhibitors (i) are devoid of partial agonism, (ii) selectively bind LFA-1 versus other integrins, (iii) do not trigger internalization of LFA-1 itself or other integrins and (iv) display oral availability. This profile differentiates the new generation of allosteric LFA-1 inhibitors from previous ligand mimetic-based LFA-1 inhibitors and anti-LFA-1 antibodies, and is projected to support novel immune regulatory regimens selectively targeting the integrin LFA-1. The rigorous computational and experimental assessment schedule described here is designed to be adaptable to the preclinical discovery and development of novel allosterically acting compounds targeting integrins other than LFA-1, providing an exemplary approach for the early characterisation of next generation integrin inhibitors.


Subject(s)
Lymphocyte Function-Associated Antigen-1 , Signal Transduction , Lymphocyte Function-Associated Antigen-1/chemistry , Lymphocyte Function-Associated Antigen-1/metabolism , Ligands , Intercellular Adhesion Molecule-1/metabolism
10.
Curr Opin Struct Biol ; 79: 102548, 2023 04.
Article in English | MEDLINE | ID: mdl-36842415

ABSTRACT

Structure-based drug design uses three-dimensional geometric information of macromolecules, such as proteins or nucleic acids, to identify suitable ligands. Geometric deep learning, an emerging concept of neural-network-based machine learning, has been applied to macromolecular structures. This review provides an overview of the recent applications of geometric deep learning in bioorganic and medicinal chemistry, highlighting its potential for structure-based drug discovery and design. Emphasis is placed on molecular property prediction, ligand binding site and pose prediction, and structure-based de novo molecular design. The current challenges and opportunities are highlighted, and a forecast of the future of geometric deep learning for drug discovery is presented.


Subject(s)
Deep Learning , Drug Design , Neural Networks, Computer , Drug Discovery/methods , Machine Learning , Ligands
11.
ACS Omega ; 8(2): 2046-2056, 2023 Jan 17.
Article in English | MEDLINE | ID: mdl-36687099

ABSTRACT

Lipophilicity, as measured by the partition coefficient between octanol and water (log P), is a key parameter in early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental data impedes development of accurate in silico models for such compounds. In certain discovery projects at Novartis focused on such compounds, a quantum mechanics (QM)-based tool for log P estimation has emerged as a valuable supplement to experimental measurements and as a preferred alternative to existing empirical models. However, this QM-based approach incurs a substantial computational cost, limiting its applicability to small series and prohibiting quick, interactive ideation. This work explores a set of machine learning models (Random Forest, Lasso, XGBoost, Chemprop, and Chemprop3D) to learn calculated log P values on both a public data set and an in-house data set to obtain a computationally affordable, QM-based estimation of drug lipophilicity. The message-passing neural network model Chemprop emerged as the best performing model with mean absolute errors of 0.44 and 0.34 log units for scaffold split test sets of the public and in-house data sets, respectively. Analysis of learning curves suggests that a further decrease in the test set error can be achieved by increasing the training set size. While models directly trained on experimental data perform better at approximating experimentally determined log P values than models trained on calculated values, we discuss the potential advantages of using calculated log P values going beyond the limits of experimental quantitation. We analyze the impact of the data set splitting strategy and gain insights into model failure modes. Potential use cases for the presented models include pre-screening of large compound collections and prioritization of compounds for full QM calculations.

12.
Nat Commun ; 14(1): 114, 2023 01 07.
Article in English | MEDLINE | ID: mdl-36611029

ABSTRACT

Generative chemical language models (CLMs) can be used for de novo molecular structure generation by learning from a textual representation of molecules. Here, we show that hybrid CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma (PI3Kγ), a collection of virtual molecules was created with a generative CLM. This virtual compound library was refined using a CLM-based classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented molecular structures and fine-tuned with known PI3Kγ ligands. Several of the computer-generated molecular designs were commercially available, enabling fast prescreening and preliminary experimental validation. A new PI3Kγ ligand with sub-micromolar activity was identified, highlighting the method's scaffold-hopping potential. Chemical synthesis and biochemical testing of two of the top-ranked de novo designed molecules and their derivatives corroborated the model's ability to generate PI3Kγ ligands with medium to low nanomolar activity for hit-to-lead expansion. The most potent compounds led to pronounced inhibition of PI3K-dependent Akt phosphorylation in a medulloblastoma cell model, demonstrating efficacy of PI3Kγ ligands in PI3K/Akt pathway repression in human tumor cells. The results positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design.


Subject(s)
Phosphatidylinositol 3-Kinases , Proto-Oncogene Proteins c-akt , Humans , Molecular Structure , Ligands , Drug Design , Phosphatidylinositol 3-Kinase
13.
Cell Oncol (Dordr) ; 46(2): 331-356, 2023 Apr.
Article in English | MEDLINE | ID: mdl-36495366

ABSTRACT

PURPOSE: Aberrant activation of the fibroblast growth factor receptor (FGFR) family of receptor tyrosine kinases drives oncogenic signaling through its proximal adaptor protein FRS2. Precise disruption of this disease-causing signal transmission in metastatic cancers could stall tumor growth and progression. The purpose of this study was to identify a small molecule ligand of FRS2 to interrupt oncogenic signal transmission from activated FGFRs. METHODS: We used pharmacophore-based computational screening to identify potential small molecule ligands of the PTB domain of FRS2, which couples FRS2 to FGFRs. We confirmed PTB domain binding of molecules identified with biophysical binding assays and validated compound activity in cell-based functional assays in vitro and in an ovarian cancer model in vivo. We used thermal proteome profiling to identify potential off-targets of the lead compound. RESULTS: We describe a small molecule ligand of the PTB domain of FRS2 that prevents FRS2 activation and interrupts FGFR signaling. This PTB-domain ligand displays on-target activity in cells and stalls FGFR-dependent matrix invasion in various cancer models. The small molecule ligand is detectable in the serum of mice at the effective concentration for prolonged time and reduces growth of the ovarian cancer model in vivo. Using thermal proteome profiling, we furthermore identified potential off-targets of the lead compound that will guide further compound refinement and drug development. CONCLUSIONS: Our results illustrate a phenotype-guided drug discovery strategy that identified a novel mechanism to repress FGFR-driven invasiveness and growth in human cancers. The here identified bioactive leads targeting FGF signaling and cell dissemination provide a novel structural basis for further development as a tumor agnostic strategy to repress FGFR- and FRS2-driven tumors.


Subject(s)
Drug Discovery , Ovarian Neoplasms , Animals , Female , Humans , Mice , Adaptor Proteins, Signal Transducing/chemistry , Adaptor Proteins, Signal Transducing/metabolism , Ligands , Membrane Proteins/chemistry , Membrane Proteins/metabolism , Ovarian Neoplasms/drug therapy , Proteome/metabolism , Receptors, Fibroblast Growth Factor/metabolism , Signal Transduction/physiology , Drug Discovery/methods
14.
Methods Mol Biol ; 2576: 477-493, 2023.
Article in English | MEDLINE | ID: mdl-36152211

ABSTRACT

Computational methods in medicinal chemistry facilitate drug discovery and design. In particular, machine learning methodologies have recently gained increasing attention. This chapter provides a structured overview of the current state of computational chemistry and its applications for the interrogation of the endocannabinoid system (ECS), highlighting methods in structure-based drug design, virtual screening, ligand-based quantitative structure-activity relationship (QSAR) modeling, and de novo molecular design. We emphasize emerging methods in machine learning and anticipate a forecast of future opportunities of computational medicinal chemistry for the ECS.


Subject(s)
Computational Chemistry , Endocannabinoids , Drug Design , Ligands , Machine Learning , Quantitative Structure-Activity Relationship
15.
Nat Comput Sci ; 3(11): 922-933, 2023 Nov.
Article in English | MEDLINE | ID: mdl-38177601

ABSTRACT

Autoencoders are versatile tools in molecular informatics. These unsupervised neural networks serve diverse tasks such as data-driven molecular representation and constructive molecular design. This Review explores their algorithmic foundations and applications in drug discovery, highlighting the most active areas of development and the contributions autoencoder networks have made in advancing this field. We also explore the challenges and prospects concerning the utilization of autoencoders and the various adaptations of this neural network architecture in molecular design.


Subject(s)
Drug Discovery , Neural Networks, Computer
16.
Sci Data ; 9(1): 273, 2022 06 07.
Article in English | MEDLINE | ID: mdl-35672335

ABSTRACT

Machine learning approaches in drug discovery, as well as in other areas of the chemical sciences, benefit from curated datasets of physical molecular properties. However, there currently is a lack of data collections featuring large bioactive molecules alongside first-principle quantum chemical information. The open-access QMugs (Quantum-Mechanical Properties of Drug-like Molecules) dataset fills this void. The QMugs collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling ~2 M conformers. QMugs contains optimized molecular geometries and thermodynamic data obtained via the semi-empirical method GFN2-xTB. Atomic and molecular properties are provided on both the GFN2-xTB and on the density-functional levels of theory (DFT, ωB97X-D/def2-SVP). QMugs features molecules of significantly larger size than previously-reported collections and comprises their respective quantum mechanical wave functions, including DFT density and orbital matrices. This dataset is intended to facilitate the development of models that learn from molecular data on different levels of theory while also providing insight into the corresponding relationships between molecular structure and biological activity.


Subject(s)
Drug Discovery , Machine Learning , Thermodynamics
17.
Sci Rep ; 12(1): 7843, 2022 05 12.
Article in English | MEDLINE | ID: mdl-35551258

ABSTRACT

As there are no clear on-target mechanisms that explain the increased risk for thrombosis and viral infection or reactivation associated with JAK inhibitors, the observed elevated risk may be a result of an off-target effect. Computational approaches combined with in vitro studies can be used to predict and validate the potential for an approved drug to interact with additional (often unwanted) targets and identify potential safety-related concerns. Potential off-targets of the JAK inhibitors baricitinib and tofacitinib were identified using two established machine learning approaches based on ligand similarity. The identified targets related to thrombosis or viral infection/reactivation were subsequently validated using in vitro assays. Inhibitory activity was identified for four drug-target pairs (PDE10A [baricitinib], TRPM6 [tofacitinib], PKN2 [baricitinib, tofacitinib]). Previously unknown off-target interactions of the two JAK inhibitors were identified. As the proposed pharmacological effects of these interactions include attenuation of pulmonary vascular remodeling, modulation of HCV response, and hypomagnesemia, the newly identified off-target interactions cannot explain an increased risk of thrombosis or viral infection/reactivation. While further evidence is required to explain both the elevated thrombosis and viral infection/reactivation risk, our results add to the evidence that these JAK inhibitors are promiscuous binders and highlight the potential for repurposing.


Subject(s)
Antirheumatic Agents , Janus Kinase Inhibitors , Thrombosis , Virus Diseases , Antirheumatic Agents/adverse effects , Azetidines , Humans , Janus Kinase Inhibitors/adverse effects , Machine Learning , Phosphoric Diester Hydrolases , Piperidines , Purines , Pyrazoles , Pyrimidines , Sulfonamides , Thrombosis/chemically induced
18.
Mol Inform ; 41(10): e2200059, 2022 10.
Article in English | MEDLINE | ID: mdl-35577762

ABSTRACT

Identifying druggable ligand-binding sites on the surface of the macromolecular targets is an important process in structure-based drug discovery. Deep-learning models have been shown to successfully predict ligand-binding sites of proteins. As a step toward predicting binding sites in RNA and RNA-protein complexes, we employ three-dimensional convolutional neural networks. We introduce a dataset splitting approach to minimize structure-related bias in training data, and investigate the influence of protein-based neural network pre-training before fine-tuning on RNA structures. Models that were pre-trained on proteins considerably outperformed the models that were trained exclusively on RNA structures. Overall, 71 % of the known RNA binding sites were correctly located within 4 Šof their true centres.


Subject(s)
Neural Networks, Computer , Proteins , Binding Sites , Ligands , Proteins/chemistry , RNA/metabolism
19.
Phys Chem Chem Phys ; 24(18): 10775-10783, 2022 May 11.
Article in English | MEDLINE | ID: mdl-35470831

ABSTRACT

Many molecular design tasks benefit from fast and accurate calculations of quantum-mechanical (QM) properties. However, the computational cost of QM methods applied to drug-like molecules currently renders large-scale applications of quantum chemistry challenging. Aiming to mitigate this problem, we developed DelFTa, an open-source toolbox for the prediction of electronic properties of drug-like molecules at the density functional (DFT) level of theory, using Δ-machine-learning. Δ-Learning corrects the prediction error (Δ) of a fast but inaccurate property calculation. DelFTa employs state-of-the-art three-dimensional message-passing neural networks trained on a large dataset of QM properties. It provides access to a wide array of quantum observables on the molecular, atomic and bond levels by predicting approximations to DFT values from a low-cost semiempirical baseline. Δ-Learning outperformed its direct-learning counterpart for most of the considered QM endpoints. The results suggest that predictions for non-covalent intra- and intermolecular interactions can be extrapolated to larger biomolecular systems. The software is fully open-sourced and features documented command-line and Python APIs.


Subject(s)
Chemistry, Pharmaceutical , Quantum Theory , Machine Learning , Neural Networks, Computer , Software
20.
J Chem Inf Model ; 62(5): 1199-1206, 2022 03 14.
Article in English | MEDLINE | ID: mdl-35191696

ABSTRACT

Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified molecular input line entry system (SMILES) strings. However, the quality of these de novo generated molecules is difficult to assess a priori. In this study, we apply the perplexity metric to determine the degree to which the molecules generated by a CLM match the desired design objectives. This model-intrinsic score allows identifying and ranking the most promising molecular designs based on the probabilities learned by the CLM. Using perplexity to compare "greedy" (beam search) with "explorative" (multinomial sampling) methods for SMILES generation, certain advantages of multinomial sampling become apparent. Additionally, perplexity scoring is performed to identify undesired model biases introduced during model training and allows the development of a new ranking system to remove those undesired biases.


Subject(s)
Language , Models, Chemical , Probability
SELECTION OF CITATIONS
SEARCH DETAIL
...