Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
Add more filters










Publication year range
1.
J Chem Inf Model ; 60(12): 6065-6073, 2020 12 28.
Article in English | MEDLINE | ID: mdl-33118813

ABSTRACT

Identifying and purchasing new small molecules to test in biological assays are enabling for ligand discovery, but as purchasable chemical space continues to grow into the tens of billions based on inexpensive make-on-demand compounds, simply searching this space becomes a major challenge. We have therefore developed ZINC20, a new version of ZINC with two major new features: billions of new molecules and new methods to search them. As a fully enumerated database, ZINC can be searched precisely using explicit atomic-level graph-based methods, such as SmallWorld for similarity and Arthor for pattern and substructure search, as well as 3D methods such as docking. Analysis of the new make-on-demand compound sets by these and related tools reveals startling features. For instance, over 97% of the core Bemis-Murcko scaffolds in make-on-demand libraries are unavailable from "in-stock" collections. Correspondingly, the number of new Bemis-Murcko scaffolds is rising almost as a linear fraction of the elaborated molecules. Thus, an 88-fold increase in the number of molecules in the make-on-demand versus the in-stock sets is built upon a 16-fold increase in the number of Bemis-Murcko scaffolds. The make-on-demand library is also more structurally diverse than physical libraries, with a massive increase in disc- and sphere-like shaped molecules. The new system is freely available at zinc20.docking.org.


Subject(s)
Databases, Chemical , Databases, Factual , Ligands
2.
J Cheminform ; 9: 10, 2017.
Article in English | MEDLINE | ID: mdl-28286573

ABSTRACT

The symbols for the new IUPAC elements named in November 2016 can introduce subtle ambiguities within cheminformatics software. The ambiguities are described and demonstrated by highlighting inconsistencies between software when handling existing element symbols.

3.
J Cheminform ; 8: 36, 2016.
Article in English | MEDLINE | ID: mdl-27382417

ABSTRACT

BACKGROUND: The concept of molecular similarity is one of the central ideas in cheminformatics, despite the fact that it is ill-defined and rather difficult to assess objectively. Here we propose a practical definition of molecular similarity in the context of drug discovery: molecules A and B are similar if a medicinal chemist would be likely to synthesise and test them around the same time as part of the same medicinal chemistry program. The attraction of such a definition is that it matches one of the key uses of similarity measures in early-stage drug discovery. If we make the assumption that molecules in the same compound activity table in a medicinal chemistry paper were considered similar by the authors of the paper, we can create a dataset of similar molecules from the medicinal chemistry literature. Furthermore, molecules with decreasing levels of similarity to a reference can be found by either ordering molecules in an activity table by their activity, or by considering activity tables in different papers which have at least one molecule in common. RESULTS: Using this procedure with activity data from ChEMBL, we have created two benchmark datasets for structural similarity that can be used to guide the development of improved measures. Compared to similar results from a virtual screen, these benchmarks are an order of magnitude more sensitive to differences between fingerprints both because of their size and because they avoid loss of statistical power due to the use of mean scores or ranks. We measure the performance of 28 different fingerprints on the benchmark sets and compare the results to those from the Riniker and Landrum (J Cheminf 5:26, 2013. doi:10.1186/1758-2946-5-26) ligand-based virtual screening benchmark. CONCLUSIONS: Extended-connectivity fingerprints of diameter 4 and 6 are among the best performing fingerprints when ranking diverse structures by similarity, as is the topological torsion fingerprint. However, when ranking very close analogues, the atom pair fingerprint outperforms the others tested. When ranking diverse structures or carrying out a virtual screen, we find that the performance of the ECFP fingerprints significantly improves if the bit-vector length is increased from 1024 to 16,384.Graphical abstractAn example series from one of the benchmark datasets. Each fingerprint is assessed on its ability to reproduce a specific series order.

4.
Article in English | MEDLINE | ID: mdl-27060160

ABSTRACT

Awareness of the adverse effects of chemicals is important in biomedical research and healthcare. Text mining can allow timely and low-cost extraction of this knowledge from the biomedical literature. We extended our text mining solution, LeadMine, to identify diseases and chemical-induced disease relationships (CIDs). LeadMine is a dictionary/grammar-based entity recognizer and was used to recognize and normalize both chemicals and diseases to Medical Subject Headings (MeSH) IDs. The disease lexicon was obtained from three sources: MeSH, the Disease Ontology and Wikipedia. The Wikipedia dictionary was derived from pages with a disease/symptom box, or those where the page title appeared in the lexicon. Composite entities (e.g. heart and lung disease) were detected and mapped to their composite MeSH IDs. For CIDs, we developed a simple pattern-based system to find relationships within the same sentence. Our system was evaluated in the BioCreative V Chemical-Disease Relation task and achieved very good results for both disease concept ID recognition (F1-score: 86.12%) and CIDs (F1-score: 52.20%) on the test set. As our system was over an order of magnitude faster than other solutions evaluated on the task, we were able to apply the same system to the entirety of MEDLINE allowing us to extract a collection of over 250 000 distinct CIDs.


Subject(s)
Computational Biology/methods , Data Mining/methods , Databases, Chemical , Hazardous Substances/toxicity , Search Engine , Animals , Databases, Factual , Disease/etiology , Disease Models, Animal , Drug-Related Side Effects and Adverse Reactions , Humans , Internet , Medical Subject Headings , Pattern Recognition, Automated
5.
J Med Chem ; 59(9): 4385-402, 2016 05 12.
Article in English | MEDLINE | ID: mdl-27028220

ABSTRACT

Multiple recent studies have focused on unraveling the content of the medicinal chemist's toolbox. Here, we present an investigation of chemical reactions and molecules retrieved from U.S. patents over the past 40 years (1976-2015). We used a sophisticated text-mining pipeline to extract 1.15 million unique whole reaction schemes, including reaction roles and yields, from pharmaceutical patents. The reactions were assigned to well-known reaction types such as Wittig olefination or Buchwald-Hartwig amination using an expert system. Analyzing the evolution of reaction types over time, we observe the previously reported bias toward reaction classes like amide bond formations or Suzuki couplings. Our study also shows a steady increase in the number of different reaction types used in pharmaceutical patents but a trend toward lower median yield for some of the reaction classes. Finally, we found that today's typical product molecule is larger, more hydrophobic, and more rigid than 40 years ago.


Subject(s)
Chemistry, Pharmaceutical , Drug Industry , Patents as Topic , History, 20th Century , History, 21st Century , Workforce
6.
J Chem Inf Model ; 55(10): 2111-20, 2015 Oct 26.
Article in English | MEDLINE | ID: mdl-26441310

ABSTRACT

Finding a canonical ordering of the atoms in a molecule is a prerequisite for generating a unique representation of the molecule. The canonicalization of a molecule is usually accomplished by applying some sort of graph relaxation algorithm, the most common of which is the Morgan algorithm. There are known issues with that algorithm that lead to noncanonical atom orderings as well as problems when it is applied to large molecules like proteins. Furthermore, each cheminformatics toolkit or software provides its own version of a canonical ordering, most based on unpublished algorithms, which also complicates the generation of a universal unique identifier for molecules. We present an alternative canonicalization approach that uses a standard stable-sorting algorithm instead of a Morgan-like index. Two new invariants that allow canonical ordering of molecules with dependent chirality as well as those with highly symmetrical cyclic graphs have been developed. The new approach proved to be robust and fast when tested on the 1.45 million compounds of the ChEMBL 20 data set in different scenarios like random renumbering of input atoms or SMILES round tripping. Our new algorithm is able to generate a canonical order of the atoms of protein molecules within a few milliseconds. The novel algorithm is implemented in the open-source cheminformatics toolkit RDKit. With this paper, we provide a reference Python implementation of the algorithm that could easily be integrated in any cheminformatics toolkit. This provides a first step toward a common standard for canonical atom ordering to generate a universal unique identifier for molecules other than InChI.


Subject(s)
Algorithms , Models, Molecular , Small Molecule Libraries/chemistry , Software , Stereoisomerism
7.
J Cheminform ; 7(Suppl 1 Text mining for chemistry and the CHEMDNER track): S2, 2015.
Article in English | MEDLINE | ID: mdl-25810773

ABSTRACT

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

8.
J Cheminform ; 7(Suppl 1 Text mining for chemistry and the CHEMDNER track): S5, 2015.
Article in English | MEDLINE | ID: mdl-25810776

ABSTRACT

BACKGROUND: Chemical entity recognition has traditionally been performed by machine learning approaches. Here we describe an approach using grammars and dictionaries. This approach has the advantage that the entities found can be directly related to a given grammar or dictionary, which allows the type of an entity to be known and, if an entity is misannotated, indicates which resource should be corrected. As recognition is driven by what is expected, if spelling errors occur, they can be corrected. Correcting such errors is highly useful when attempting to lookup an entity in a database or, in the case of chemical names, converting them to structures. RESULTS: Our system uses a mixture of expertly curated grammars and dictionaries, as well as dictionaries automatically derived from public resources. We show that the heuristics developed to filter our dictionary of trivial chemical names (from PubChem) yields a better performing dictionary than the previously published Jochem dictionary. Our final system performs post-processing steps to modify the boundaries of entities and to detect abbreviations. These steps are shown to significantly improve performance (2.6% and 4.0% F1-score respectively). Our complete system, with incremental post-BioCreative workshop improvements, achieves 89.9% precision and 85.4% recall (87.6% F1-score) on the CHEMDNER test set. CONCLUSIONS: Grammar and dictionary approaches can produce results at least as good as the current state of the art in machine learning approaches. While machine learning approaches are commonly thought of as "black box" systems, our approach directly links the output entities to the input dictionaries and grammars. Our approach also allows correction of errors in detected entities, which can assist with entity resolution.

10.
J Chem Inf Model ; 55(1): 39-53, 2015 Jan 26.
Article in English | MEDLINE | ID: mdl-25541888

ABSTRACT

Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties. We validated the fingerprints on a large data set of reactions text-mined from granted United States patents from the last 40 years that have been classified using a substructure-based expert system. We applied machine learning to build a 50-class predictive model for reaction-type classification that correctly predicts 97% of the reactions in an external test set. Impressive accuracies were also observed when applying the classifier to reactions from an in-house electronic laboratory notebook. The performance of the novel fingerprint for assessing reaction similarity was evaluated by a cluster analysis that recovered 48 out of 50 of the reaction classes with a median F-score of 0.63 for the clusters. The data sets used for training and primary validation as well as all python scripts required to reproduce the analysis are provided in the Supporting Information.


Subject(s)
Artificial Intelligence , Databases, Chemical , Models, Chemical , Cluster Analysis , Organic Chemistry Phenomena , Patents as Topic , Reproducibility of Results
11.
J Med Chem ; 57(6): 2704-13, 2014 Mar 27.
Article in English | MEDLINE | ID: mdl-24601597

ABSTRACT

A matched molecular series is the general form of a matched molecular pair and refers to a set of two or more molecules with the same scaffold but different R groups at the same position. We describe Matsy, a knowledge-based method that uses matched series to predict R groups likely to improve activity given an observed activity order for some R groups. We compare the Matsy predictions based on activity data from ChEMBLdb to the recommendations of the Topliss tree and carry out a large scale retrospective test to measure performance. We show that the basis for predictive success is preferred orders in matched series and that this preference is stronger for longer series. The Matsy algorithm allows medicinal chemists to integrate activity trends from diverse medicinal chemistry programs and apply them to problems of interest as a Topliss-like recommendation or as a hypothesis generator to aid compound design.


Subject(s)
Algorithms , Drug Design , Structure-Activity Relationship , Alkanes/chemical synthesis , Alkanes/chemistry , Computational Biology , Computer Simulation , Databases, Chemical , Molecular Structure , Predictive Value of Tests
12.
Acta Crystallogr D Biol Crystallogr ; 68(Pt 8): 1003-9, 2012 Aug.
Article in English | MEDLINE | ID: mdl-22868766

ABSTRACT

In protein crystallization, as well as in many other fields, it is known that the pH at which experiments are performed is often the key factor in the success or failure of the trials. With the trend towards plate-based high-throughput experimental techniques, measuring the pH values of solutions one by one becomes prohibitively time- and reagent-expensive. As part of an HT crystallization facility, a colour-based pH assay that is rapid, uses very little reagent and is suitable for 96-well or higher density plates has been developed.


Subject(s)
Coloring Agents/chemistry , Indicators and Reagents/chemistry , Biochemistry/methods , Calibration , Colorimetry/methods , Coloring Agents/standards , Crystallization/standards , Crystallography, X-Ray/methods , Hydrogen-Ion Concentration , Indicators and Reagents/standards , Proteins/chemistry , Solutions , Time Factors
13.
Article in English | MEDLINE | ID: mdl-22442216

ABSTRACT

When crystallization screening is conducted many outcomes are observed but typically the only trial recorded in the literature is the condition that yielded the crystal(s) used for subsequent diffraction studies. The initial hit that was optimized and the results of all the other trials are lost. These missing results contain information that would be useful for an improved general understanding of crystallization. This paper provides a report of a crystallization data exchange (XDX) workshop organized by several international large-scale crystallization screening laboratories to discuss how this information may be captured and utilized. A group that administers a significant fraction of the world's crystallization screening results was convened, together with chemical and structural data informaticians and computational scientists who specialize in creating and analysing large disparate data sets. The development of a crystallization ontology for the crystallization community was proposed. This paper (by the attendees of the workshop) provides the thoughts and rationale leading to this conclusion. This is brought to the attention of the wider audience of crystallographers so that they are aware of these early efforts and can contribute to the process going forward.


Subject(s)
Crystallography, X-Ray , Crystallization , Databases, Factual
14.
J Comput Aided Mol Des ; 24(6-7): 485-96, 2010 Jun.
Article in English | MEDLINE | ID: mdl-20309607

ABSTRACT

It appears so simple at first glance, "tautomers are isomers of organic compounds that readily interconvert, usually by the migration of hydrogen from one atom to another". If a chemist can describe the problem so succinctly, one might question why the complication of tautomerism remains a considerable challenge to cheminformatics and computer-assisted drug design. With a half-century of experience with representing molecules in computers, and almost limitless modern computational power, the problem should have been solved by now. The unfortunate answer is that the frustration and inconvenience of a database search failing to find matches due to differences in the tautomeric forms of the query and registered compounds is but the tip of an iceberg. Prototropic tautomerism, the movement of hydrogens around a molecule, is but just one aspect of an interconnected web of complications. These include mesomerism, aromaticity, protonation state, stereochemistry, conformation, polymerization, photostability, hydrolysis, metabolism and EOCWR (explodes on contact with reality). The common theme is that valence theory, which underlies all modern chemical informatics systems, is an approximate theoretical model for representing molecules mathematically, and, as with all models, it has limitations and domains of applicability. In the physical environments that chemists care about, small organic molecules are often dynamic, existing in multiple equivalent or interconvertible forms. A single connection table can at best represent a snapshot or sample from these populations. Although partial algorithmic solutions exist for handling the most common cases of tautomerism, this perspective hopes to argue that the underlying problems perhaps make tautomerism more complex than it might first appear.


Subject(s)
Hydrocarbons, Aromatic/chemistry , Ions/chemistry , Isomerism , Protons , Thermodynamics
15.
J Chem Inf Model ; 46(5): 1912-8, 2006.
Article in English | MEDLINE | ID: mdl-16995721

ABSTRACT

We apply a recently published method of text-based molecular similarity searching (LINGO) to standard data sets for the purpose of quantifying the accuracy of the approach. Our implementation is based on a pattern-matching finite state machine (FSM) which results in fast search times. The accuracy of LINGO is demonstrated to be comparable to that of a path-based fingerprint and offers a simple yet effective method for similarity searching.


Subject(s)
Molecular Structure , Algorithms , DNA/chemistry , Finite Element Analysis , Proteins/chemistry
SELECTION OF CITATIONS
SEARCH DETAIL
...