Search | VHL Regional Portal

1.

PubChem synonym filtering process using crowdsourcing.

Kim, Sunghwan; Yu, Bo; Li, Qingliang; Bolton, Evan E.

J Cheminform ; 16(1): 69, 2024 Jun 16.

Article in English | MEDLINE | ID: mdl-38880887

ABSTRACT

PubChem ( https://pubchem.ncbi.nlm.nih.gov ) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a "chemical synonym"). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem's crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem's filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.

2.

Can Small Molecules Provide Clues on Disease Progression in Cerebrospinal Fluid from Mild Cognitive Impairment and Alzheimer's Disease Patients?

Talavera Andújar, Begoña; Mary, Arnaud; Venegas, Carmen; Cheng, Tiejun; Zaslavsky, Leonid; Bolton, Evan E; Heneka, Michael T; Schymanski, Emma L.

Environ Sci Technol ; 58(9): 4181-4192, 2024 Mar 05.

Article in English | MEDLINE | ID: mdl-38373301

ABSTRACT

Alzheimer's disease (AD) is a complex and multifactorial neurodegenerative disease, which is currently diagnosed via clinical symptoms and nonspecific biomarkers (such as Aß1-42, t-Tau, and p-Tau) measured in cerebrospinal fluid (CSF), which alone do not provide sufficient insights into disease progression. In this pilot study, these biomarkers were complemented with small-molecule analysis using non-target high-resolution mass spectrometry coupled with liquid chromatography (LC) on the CSF of three groups: AD, mild cognitive impairment (MCI) due to AD, and a non-demented (ND) control group. An open-source cheminformatics pipeline based on MS-DIAL and patRoon was enhanced using CSF- and AD-specific suspect lists to assist in data interpretation. Chemical Similarity Enrichment Analysis revealed a significant increase of hydroxybutyrates in AD, including 3-hydroxybutanoic acid, which was found at higher levels in AD compared to MCI and ND. Furthermore, a highly sensitive target LC-MS method was used to quantify 35 bile acids (BAs) in the CSF, revealing several statistically significant differences including higher dehydrolithocholic acid levels and decreased conjugated BA levels in AD. This work provides several promising small-molecule hypotheses that could be used to help track the progression of AD in CSF samples.

Subject(s)

Alzheimer Disease , Cognitive Dysfunction , Neurodegenerative Diseases , Humans , Alzheimer Disease/cerebrospinal fluid , Alzheimer Disease/diagnosis , Alzheimer Disease/psychology , tau Proteins/cerebrospinal fluid , Amyloid beta-Peptides/cerebrospinal fluid , Pilot Projects , Cognitive Dysfunction/cerebrospinal fluid , Cognitive Dysfunction/diagnosis , Cognitive Dysfunction/psychology , Biomarkers , Disease Progression

3.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Beck, Jeff; Bolton, Evan E; Brister, J Rodney; Chan, Jessica; Comeau, Donald C; Connor, Ryan; DiCuccio, Michael; Farrell, Catherine M; Feldgarden, Michael; Fine, Anna M; Funk, Kathryn; Hatcher, Eneida; Hoeppner, Marilu; Kane, Megan; Kannan, Sivakumar; Katz, Kenneth S; Kelly, Christopher; Klimke, William; Kim, Sunghwan; Kimchi, Avi; Landrum, Melissa; Lathrop, Stacy; Lu, Zhiyong; Malheiro, Adriana; Marchler-Bauer, Aron; Murphy, Terence D; Phan, Lon; Prasad, Arjun B; Pujar, Shashikant; Sawyer, Amanda; Schmieder, Erin; Schneider, Valerie A; Schoch, Conrad L; Sharma, Shobha; Thibaud-Nissen, Françoise; Trawick, Barton W; Venkatapathi, Thilakam; Wang, Jiyao; Pruitt, Kim D; Sherry, Stephen T.

Nucleic Acids Res ; 52(D1): D33-D43, 2024 Jan 05.

Article in English | MEDLINE | ID: mdl-37994677

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, the NIH Comparative Genomics Resource (CGR), NCBI Virus, SRA, RefSeq, foreign contamination screening tools, Taxonomy, iCn3D, ClinVar, GTR, MedGen, dbSNP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Subject(s)

Databases, Genetic , National Library of Medicine (U.S.) , Biotechnology/instrumentation , Databases, Nucleic Acid , Internet , United States

4.

A workflow for deriving chemical entities from crystallographic data and its application to the Crystallography Open Database.

Vaitkus, Antanas; Merkys, Andrius; Sander, Thomas; Quirós, Miguel; Thiessen, Paul A; Bolton, Evan E; Grazulis, Saulius.

J Cheminform ; 15(1): 123, 2023 Dec 19.

Article in English | MEDLINE | ID: mdl-38115123

ABSTRACT

Knowledge about the 3-dimensional structure, orientation and interaction of chemical compounds is important in many areas of science and technology. X-ray crystallography is one of the experimental techniques capable of providing a large amount of structural information for a given compound, and it is widely used for characterisation of organic and metal-organic molecules. The method provides precise 3D coordinates of atoms inside crystals, however, it does not directly deliver information about certain chemical characteristics such as bond orders, delocalization, charges, lone electron pairs or lone electrons. These aspects of a molecular model have to be derived from crystallographic data using refined information about interatomic distances and atom types as well as employing general chemical knowledge. This publication describes a curated automatic pipeline for the derivation of chemical attributes of molecules from crystallographic models. The method is applied to build a catalogue of chemical entities in an open-access crystallographic database, the Crystallography Open Database (COD). The catalogue of such chemical entities is provided openly as a derived database. The content of this catalogue and the problems arising in the fully automated pipeline are discussed, along with the possibilities to introduce manual data curation into the process.

5.

Per- and Polyfluoroalkyl Substances (PFAS) in PubChem: 7 Million and Growing.

Schymanski, Emma L; Zhang, Jian; Thiessen, Paul A; Chirsir, Parviel; Kondic, Todor; Bolton, Evan E.

Environ Sci Technol ; 57(44): 16918-16928, 2023 11 07.

Article in English | MEDLINE | ID: mdl-37871188

ABSTRACT

Per- and polyfluoroalkyl substances (PFAS) are of high concern, with calls to regulate them as a class. In 2021, the Organisation for Economic Co-operation and Development (OECD) revised the definition of PFAS to include any chemical containing at least one saturated CF2 or CF3 moiety. The consequence is that one of the largest open chemical collections, PubChem, with 116 million compounds, now contains over 7 million PFAS under this revised definition. These numbers are several orders of magnitude higher than previously established PFAS lists (typically thousands of entries) and pose an incredible challenge to researchers and computational workflows alike. This article describes a dynamic, openly accessible effort to navigate and explore the >7 million PFAS and >21 million fluorinated compounds (September 2023) in PubChem by establishing the "PFAS and Fluorinated Compounds in PubChem" Classification Browser (or "PubChem PFAS Tree"). A total of 36500 nodes support browsing of the content according to several categories, including classification, structural properties, regulatory status, or presence in existing PFAS suspect lists. Additional annotation and associated data can be used to create subsets (and thus manageable suspect lists or databases) of interest for a wide range of environmental, regulatory, exposomics, and other applications.

Subject(s)

Fluorocarbons , Water Pollutants, Chemical , Databases, Factual , Trees

6.

ShinyTPs: Curating Transformation Products from Text Mining Results.

Palm, Emma H; Chirsir, Parviel; Krier, Jessy; Thiessen, Paul A; Zhang, Jian; Bolton, Evan E; Schymanski, Emma L.

Environ Sci Technol Lett ; 10(10): 865-871, 2023 Oct 10.

Article in English | MEDLINE | ID: mdl-37840815

ABSTRACT

Transformation product (TP) information is essential to accurately evaluate the hazards compounds pose to human health and the environment. However, information about TPs is often limited, and existing data is often not fully Findable, Accessible, Interoperable, and Reusable (FAIR). FAIRifying existing TP knowledge is a relatively easy path toward improving access to data for identification workflows and for machine-learning-based algorithms. ShinyTPs was developed to curate existing transformation information derived from text-mined data within the PubChem database. The application (available as an R package) visualizes the text-mined chemical names to facilitate the user validation of the automatically extracted reactions. ShinyTPs was applied to a case study using 436 tentatively identified compounds to prioritize TP retrieval. This resulted in the extraction of 645 reactions (associated with 496 compounds), of which 319 were not previously available in PubChem. The curated reactions were added to the PubChem Transformations library, which was used as a TP suspect list for identification of TPs using the open-source workflow patRoon. In total, 72 compounds from the library were tentatively identified, 18% of which were curated using ShinyTPs, showing that the app can help support TP identification in non-target analysis workflows.

7.

Adding open spectral data to MassBank and PubChem using open source tools to support non-targeted exposomics of mixtures.

Elapavalore, Anjana; Kondic, Todor; Singh, Randolph R; Shoemaker, Benjamin A; Thiessen, Paul A; Zhang, Jian; Bolton, Evan E; Schymanski, Emma L.

Environ Sci Process Impacts ; 25(11): 1788-1801, 2023 Nov 15.

Article in English | MEDLINE | ID: mdl-37431591

ABSTRACT

The term "exposome" is defined as a comprehensive study of life-course environmental exposures and the associated biological responses. Humans are exposed to many different chemicals, which can pose a major threat to the well-being of humanity. Targeted or non-targeted mass spectrometry techniques are widely used to identify and characterize various environmental stressors when linking exposures to human health. However, identification remains challenging due to the huge chemical space applicable to exposomics, combined with the lack of sufficient relevant entries in spectral libraries. Addressing these challenges requires cheminformatics tools and database resources to share curated open spectral data on chemicals to improve the identification of chemicals in exposomics studies. This article describes efforts to contribute spectra relevant for exposomics to the open mass spectral library MassBank (https://www.massbank.eu) using various open source software efforts, including the R packages RMassBank and Shinyscreen. The experimental spectra were obtained from ten mixtures containing toxicologically relevant chemicals from the US Environmental Protection Agency (EPA) Non-Targeted Analysis Collaborative Trial (ENTACT). Following processing and curation, 5582 spectra from 783 of the 1268 ENTACT compounds were added to MassBank, and through this to other open spectral libraries (e.g., MoNA, GNPS) for community benefit. Additionally, an automated deposition and annotation workflow was developed with PubChem to enable the display of all MassBank mass spectra in PubChem, which is rerun with each MassBank release. The new spectral records have already been used in several studies to increase the confidence in identification in non-target small molecule identification workflows applied to environmental and exposomics research.

Subject(s)

Environmental Exposure , Software , Humans , Mass Spectrometry/methods , Environmental Exposure/analysis , Databases, Factual

8.

Bridging glycoinformatics and cheminformatics: integration efforts between GlyCosmos and PubChem.

Cheng, Tiejun; Ono, Tamiko; Shiota, Masaaki; Yamada, Issaku; Aoki-Kinoshita, Kiyoko F; Bolton, Evan E.

Glycobiology ; 33(6): 454-463, 2023 06 21.

Article in English | MEDLINE | ID: mdl-37129482

ABSTRACT

The GlyCosmos Glycoscience Portal (https://glycosmos.org) and PubChem (https://pubchem.ncbi.nlm.nih.gov/) are major portals for glycoscience and chemistry, respectively. GlyCosmos is a portal for glycan-related repositories, including GlyTouCan, GlycoPOST, and UniCarb-DR, as well as for glycan-related data resources that have been integrated from a variety of 'omics databases. Glycogenes, glycoproteins, lectins, pathways, and disease information related to glycans are accessible from GlyCosmos. PubChem, on the other hand, is a chemistry-based portal at the National Center for Biotechnology Information. PubChem provides information not only on chemicals, but also genes, proteins, pathways, as well as patents, bioassays, and more, from hundreds of data resources from around the world. In this work, these 2 portals have made substantial efforts to integrate their complementary data to allow users to cross between these 2 domains. In addition to glycan structures, key information, such as glycan-related genes, relevant diseases, glycoproteins, and pathways, was integrated and cross-linked with one another. The interfaces were designed to enable users to easily find, access, download, and reuse data of interest across these resources. Use cases are described illustrating and highlighting the type of content that can be investigated. In total, these integrations provide life science researchers improved awareness and enhanced access to glycan-related information.

Subject(s)

Databases, Chemical , Polysaccharides , Glycosylation , Workflow , Informatics , Polysaccharides/chemistry , Glycoconjugates/chemistry

9.

PubChem 2023 update.

Kim, Sunghwan; Chen, Jie; Cheng, Tiejun; Gindulyte, Asta; He, Jia; He, Siqian; Li, Qingliang; Shoemaker, Benjamin A; Thiessen, Paul A; Yu, Bo; Zaslavsky, Leonid; Zhang, Jian; Bolton, Evan E.

Nucleic Acids Res ; 51(D1): D1373-D1380, 2023 01 06.

Article in English | MEDLINE | ID: mdl-36305812

ABSTRACT

PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource that serves a wide range of use cases. In the past two years, a number of changes were made to PubChem. Data from more than 120 data sources was added to PubChem. Some major highlights include: the integration of Google Patents data into PubChem, which greatly expanded the coverage of the PubChem Patent data collection; the creation of the Cell Line and Taxonomy data collections, which provide quick and easy access to chemical information for a given cell line and taxon, respectively; and the update of the bioassay data model. In addition, new functionalities were added to the PubChem programmatic access protocols, PUG-REST and PUG-View, including support for target-centric data download for a given protein, gene, pathway, cell line, and taxon and the addition of the 'standardize' option to PUG-REST, which returns the standardized form of an input chemical structure. A significant update was also made to PubChemRDF. The present paper provides an overview of these changes.

Subject(s)

Databases, Chemical , Drug Discovery , Drug Discovery/methods , Biological Assay , Proteins , Cheminformatics

10.

Database resources of the National Center for Biotechnology Information in 2023.

Sayers, Eric W; Bolton, Evan E; Brister, J Rodney; Canese, Kathi; Chan, Jessica; Comeau, Donald C; Farrell, Catherine M; Feldgarden, Michael; Fine, Anna M; Funk, Kathryn; Hatcher, Eneida; Kannan, Sivakumar; Kelly, Christopher; Kim, Sunghwan; Klimke, William; Landrum, Melissa J; Lathrop, Stacy; Lu, Zhiyong; Madden, Thomas L; Malheiro, Adriana; Marchler-Bauer, Aron; Murphy, Terence D; Phan, Lon; Pujar, Shashikant; Rangwala, Sanjida H; Schneider, Valerie A; Tse, Tony; Wang, Jiyao; Ye, Jian; Trawick, Barton W; Pruitt, Kim D; Sherry, Stephen T.

Nucleic Acids Res ; 51(D1): D29-D38, 2023 01 06.

Article in English | MEDLINE | ID: mdl-36370100

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. New resources include the Comparative Genome Resource (CGR) and the BLAST ClusteredNR database. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, IgBLAST, GDV, RefSeq, NCBI Virus, GenBank type assemblies, iCn3D, ClinVar, GTR, dbGaP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Subject(s)

Databases, Genetic , Databases, Nucleic Acid , United States , National Library of Medicine (U.S.) , Sequence Alignment , Biotechnology , Internet

11.

The NORMAN Suspect List Exchange (NORMAN-SLE): facilitating European and worldwide collaboration on suspect screening in high resolution mass spectrometry.

Mohammed Taha, Hiba; Aalizadeh, Reza; Alygizakis, Nikiforos; Antignac, Jean-Philippe; Arp, Hans Peter H; Bade, Richard; Baker, Nancy; Belova, Lidia; Bijlsma, Lubertus; Bolton, Evan E; Brack, Werner; Celma, Alberto; Chen, Wen-Ling; Cheng, Tiejun; Chirsir, Parviel; Cirka, Lubos; D'Agostino, Lisa A; Djoumbou Feunang, Yannick; Dulio, Valeria; Fischer, Stellan; Gago-Ferrero, Pablo; Galani, Aikaterini; Geueke, Birgit; Glowacka, Natalia; Glüge, Juliane; Groh, Ksenia; Grosse, Sylvia; Haglund, Peter; Hakkinen, Pertti J; Hale, Sarah E; Hernandez, Felix; Janssen, Elisabeth M-L; Jonkers, Tim; Kiefer, Karin; Kirchner, Michal; Koschorreck, Jan; Krauss, Martin; Krier, Jessy; Lamoree, Marja H; Letzel, Marion; Letzel, Thomas; Li, Qingliang; Little, James; Liu, Yanna; Lunderberg, David M; Martin, Jonathan W; McEachran, Andrew D; McLean, John A; Meier, Christiane; Meijer, Jeroen.

Environ Sci Eur ; 34(1): 104, 2022.

Article in English | MEDLINE | ID: mdl-36284750

ABSTRACT

Background: The NORMAN Association (https://www.norman-network.com/) initiated the NORMAN Suspect List Exchange (NORMAN-SLE; https://www.norman-network.com/nds/SLE/) in 2015, following the NORMAN collaborative trial on non-target screening of environmental water samples by mass spectrometry. Since then, this exchange of information on chemicals that are expected to occur in the environment, along with the accompanying expert knowledge and references, has become a valuable knowledge base for "suspect screening" lists. The NORMAN-SLE now serves as a FAIR (Findable, Accessible, Interoperable, Reusable) chemical information resource worldwide. Results: The NORMAN-SLE contains 99 separate suspect list collections (as of May 2022) from over 70 contributors around the world, totalling over 100,000 unique substances. The substance classes include per- and polyfluoroalkyl substances (PFAS), pharmaceuticals, pesticides, natural toxins, high production volume substances covered under the European REACH regulation (EC: 1272/2008), priority contaminants of emerging concern (CECs) and regulatory lists from NORMAN partners. Several lists focus on transformation products (TPs) and complex features detected in the environment with various levels of provenance and structural information. Each list is available for separate download. The merged, curated collection is also available as the NORMAN Substance Database (NORMAN SusDat). Both the NORMAN-SLE and NORMAN SusDat are integrated within the NORMAN Database System (NDS). The individual NORMAN-SLE lists receive digital object identifiers (DOIs) and traceable versioning via a Zenodo community (https://zenodo.org/communities/norman-sle), with a total of > 40,000 unique views, > 50,000 unique downloads and 40 citations (May 2022). NORMAN-SLE content is progressively integrated into large open chemical databases such as PubChem (https://pubchem.ncbi.nlm.nih.gov/) and the US EPA's CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard/), enabling further access to these lists, along with the additional functionality and calculated properties these resources offer. PubChem has also integrated significant annotation content from the NORMAN-SLE, including a classification browser (https://pubchem.ncbi.nlm.nih.gov/classification/#hid=101). Conclusions: The NORMAN-SLE offers a specialized service for hosting suspect screening lists of relevance for the environmental community in an open, FAIR manner that allows integration with other major chemical resources. These efforts foster the exchange of information between scientists and regulators, supporting the paradigm shift to the "one substance, one assessment" approach. New submissions are welcome via the contacts provided on the NORMAN-SLE website (https://www.norman-network.com/nds/SLE/). Supplementary Information: The online version contains supplementary material available at 10.1186/s12302-022-00680-6.

12.

Studying the Parkinson's disease metabolome and exposome in biological samples through different analytical and cheminformatics approaches: a pilot study.

Talavera Andújar, Begoña; Aurich, Dagny; Aho, Velma T E; Singh, Randolph R; Cheng, Tiejun; Zaslavsky, Leonid; Bolton, Evan E; Mollenhauer, Brit; Wilmes, Paul; Schymanski, Emma L.

Anal Bioanal Chem ; 414(25): 7399-7419, 2022 Oct.

Article in English | MEDLINE | ID: mdl-35829770

ABSTRACT

Parkinson's disease (PD) is the second most prevalent neurodegenerative disease, with an increasing incidence in recent years due to the aging population. Genetic mutations alone only explain <10% of PD cases, while environmental factors, including small molecules, may play a significant role in PD. In the present work, 22 plasma (11 PD, 11 control) and 19 feces samples (10 PD, 9 control) were analyzed by non-target high-resolution mass spectrometry (NT-HRMS) coupled to two liquid chromatography (LC) methods (reversed-phase (RP) and hydrophilic interaction liquid chromatography (HILIC)). A cheminformatics workflow was optimized using open software (MS-DIAL and patRoon) and open databases (all public MSP-formatted spectral libraries for MS-DIAL, PubChemLite for Exposomics, and the LITMINEDNEURO list for patRoon). Furthermore, five disease-specific databases and three suspect lists (on PD and related disorders) were developed, using PubChem functionality to identifying relevant unknown chemicals. The results showed that non-target screening with the larger databases generally provided better results compared with smaller suspect lists. However, two suspect screening approaches with patRoon were also good options to study specific chemicals in PD. The combination of chromatographic methods (RP and HILIC) as well as two ionization modes (positive and negative) enhanced the coverage of chemicals in the biological samples. While most metabolomics studies in PD have focused on blood and cerebrospinal fluid, we found a higher number of relevant features in feces, such as alanine betaine or nicotinamide, which can be directly metabolized by gut microbiota. This highlights the potential role of gut dysbiosis in PD development.

Subject(s)

Exposome , Neurodegenerative Diseases , Parkinson Disease , Aged , Alanine , Betaine , Cheminformatics , Humans , Metabolome , Metabolomics/methods , Niacinamide , Pilot Projects

13.

PubChem Protein, Gene, Pathway, and Taxonomy Data Collections: Bridging Biology and Chemistry through Target-Centric Views of PubChem Data.

Kim, Sunghwan; Cheng, Tiejun; He, Siqian; Thiessen, Paul A; Li, Qingliang; Gindulyte, Asta; Bolton, Evan E.

J Mol Biol ; 434(11): 167514, 2022 06 15.

Article in English | MEDLINE | ID: mdl-35227770

ABSTRACT

PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical database at the U.S. National Institutes of Health. Visited by millions of users every month, it plays a role as a key chemical information resource for biomedical research communities. Data in PubChem is from hundreds of contributors and organized into multiple collections by record type. Among these are the Protein, Gene, Pathway, and Taxonomy data collections. Records in these collections contain information on chemicals related to a given biological target (i.e., protein, gene, pathway, or taxon), helping users to analyze and interpret the biological activity data of molecules. In addition, annotations about the biological targets are collected from authoritative or curated data sources and integrated into the four collections. The content can be programmatically accessed through PubChem's web service interfaces (including PUG View). A machine-readable representation of this content is also provided within PubChemRDF.

Subject(s)

Databases, Chemical , Biology , Drug Discovery , Proteins/genetics

14.

Plant Reactome and PubChem: The Plant Pathway and (Bio)Chemical Entity Knowledgebases.

Gupta, Parul; Naithani, Sushma; Preece, Justin; Kim, Sunghwan; Cheng, Tiejun; D'Eustachio, Peter; Elser, Justin; Bolton, Evan E; Jaiswal, Pankaj.

Methods Mol Biol ; 2443: 511-525, 2022.

Article in English | MEDLINE | ID: mdl-35037224

ABSTRACT

Plant Reactome (https://plantreactome.gramene.org) and PubChem ( https://pubchem.ncbi.nlm.nih.gov ) are two reference data portals and resources for curated plant pathways, small molecules, metabolites, gene products, and macromolecular interactions. Plant Reactome knowledgebase, a conceptual plant pathway network, is built by biocuration and integrating (bio)chemical entities, gene products, and macromolecular interactions. It provides manually curated pathways for the reference species Oryza sativa (rice) and gene orthology-based projections that extend pathway knowledge to 106 plant species. Currently, it hosts 320 reference pathways for plant metabolism, hormone signaling, transport, genetic regulation, plant organ development and differentiation, and biotic and abiotic stress responses. In addition to the pathway browsing and search functions, the Plant Reactome provides the analysis tools for pathway comparison between reference and projected species, pathway enrichment in gene expression data, and overlay of gene-gene interaction data on pathways. PubChem, a popular reference database of (bio)chemical entities, provides information on small molecules and other types of chemical entities, such as siRNAs, miRNAs, lipids, carbohydrates, and chemically modified nucleotides. The data in PubChem is collected from hundreds of data sources, including Plant Reactome. This chapter provides a brief overview of the Plant Reactome and the PubChem knowledgebases, their association to other public resources providing accessory information, and how users can readily access the contents.

Subject(s)

Knowledge Bases , Metabolic Networks and Pathways , Databases, Factual , Plants/genetics , Plants/metabolism , Proteins/metabolism

15.

Database resources of the national center for biotechnology information.

Sayers, Eric W; Bolton, Evan E; Brister, J Rodney; Canese, Kathi; Chan, Jessica; Comeau, Donald C; Connor, Ryan; Funk, Kathryn; Kelly, Chris; Kim, Sunghwan; Madej, Tom; Marchler-Bauer, Aron; Lanczycki, Christopher; Lathrop, Stacy; Lu, Zhiyong; Thibaud-Nissen, Francoise; Murphy, Terence; Phan, Lon; Skripchenko, Yuri; Tse, Tony; Wang, Jiyao; Williams, Rebecca; Trawick, Barton W; Pruitt, Kim D; Sherry, Stephen T.

Nucleic Acids Res ; 50(D1): D20-D26, 2022 01 07.

Article in English | MEDLINE | ID: mdl-34850941

ABSTRACT

The National Center for Biotechnology Information (NCBI) produces a variety of online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, RefSeq, SRA, Virus, dbSNP, dbVar, ClinicalTrials.gov, MMDB, iCn3D and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Subject(s)

Biotechnology/trends , Databases, Genetic/trends , Databases, Chemical , Databases, Nucleic Acid , Databases, Protein , Humans , Internet , National Library of Medicine (U.S.) , PubMed , United States

16.

Discovering pesticides and their TPs in Luxembourg waters using open cheminformatics approaches.

Krier, Jessy; Singh, Randolph R; Kondic, Todor; Lai, Adelene; Diderich, Philippe; Zhang, Jian; Thiessen, Paul A; Bolton, Evan E; Schymanski, Emma L.

Environ Int ; 158: 106885, 2022 01.

Article in English | MEDLINE | ID: mdl-34560325

ABSTRACT

The diversity of hundreds of thousands of potential organic pollutants and the lack of (publicly available) information about many of them is a huge challenge for environmental sciences, engineering, and regulation. Suspect screening based on high-resolution liquid chromatography-mass spectrometry (LC-HRMS) has enormous potential to help characterize the presence of these chemicals in our environment, enabling the detection of known and newly emerging pollutants, as well as their potential transformation products (TPs). Here, suspect list creation (focusing on pesticides relevant for Luxembourg, incorporating data sources in 4 languages) was coupled to an automated retrieval of related TPs from PubChem based on high confidence suspect hits, to screen for pesticides and their TPs in Luxembourgish river samples. A computational workflow was established to combine LC-HRMS analysis and pre-screening of the suspects (including automated quality control steps), with spectral annotation to determine which pesticides and, in a second step, their related TPs may be present in the samples. The data analysis with Shinyscreen (https://gitlab.lcsb.uni.lu/eci/shinyscreen/), an open source software developed in house, coupled with custom-made scripts, revealed the presence of 162 potential pesticide masses and 96 potential TP masses in the samples. Further identification of these mass matches was performed using the open source approach MetFrag (https://msbi.ipb-halle.de/MetFrag/). Eventual target analysis of 36 suspects resulted in 31 pesticides and TPs confirmed at Level-1 (highest confidence), and five pesticides and TPs not confirmed due to different retention times. Spatio-temporal analysis of the results showed that TPs and pesticides followed similar trends, with a maximum number of potential detections in July. The highest detections were in the rivers Alzette and Mess and the lowest in the Sûre and Eisch. This study (a) added pesticides, classification information and related TPs into the open domain, (b) developed automated open source retrieval methods - both enhancing FAIRness (Findability, Accessibility, Interoperability and Reusability) of the data and methods; and (c) will directly support "L'Administration de la Gestion de l'Eau" on further monitoring steps in Luxembourg.

Subject(s)

Pesticides , Water Pollutants, Chemical , Cheminformatics , Luxembourg , Pesticides/analysis , Rivers , Water Pollutants, Chemical/analysis

17.

Discovering and Summarizing Relationships Between Chemicals, Genes, Proteins, and Diseases in PubChem.

Zaslavsky, Leonid; Cheng, Tiejun; Gindulyte, Asta; He, Siqian; Kim, Sunghwan; Li, Qingliang; Thiessen, Paul; Yu, Bo; Bolton, Evan E.

Front Res Metr Anal ; 6: 689059, 2021.

Article in English | MEDLINE | ID: mdl-34322655

ABSTRACT

The literature knowledge panels developed and implemented in PubChem are described. These help to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in biomedical literature abstracts. Named entities in PubMed records are matched with chemical names in PubChem, disease names in Medical Subject Headings (MeSH), and gene/protein names in popular gene/protein information resources, and the most closely related entities are identified using statistical analysis and relevance-based sampling. Knowledge panels for the co-occurrence of chemical, disease, and gene/protein entities are included in PubChem Compound, Protein, and Gene pages, summarizing these in a compact form. Statistical methods for removing redundancy and estimating relevance scores are discussed, along with benefits and pitfalls of relying on automated (i.e., not human-curated) methods operating on data from multiple heterogeneous sources.

18.

PubChem Periodic Table and Element Pages: Improving Access to Information on Chemical Elements from Authoritative Sources.

Kim, Sunghwan; Gindulyte, Asta; Zhang, Jian; Thiessen, Paul A; Bolton, Evan E.

Chem Teach Int ; 3(1): 57-65, 2021 Mar.

Article in English | MEDLINE | ID: mdl-34268481

ABSTRACT

PubChem (https://pubchem.ncbi.nlm.nih.gov) is one of the top five most visited chemistry web sites in the world, with more than five million unique users per month (as of March 2020). Many of these users are educators, undergraduate students, and graduate students at academic institutions. Therefore, PubChem has a great potential as an online resource for chemical education. This paper describes the PubChem Periodic Table and Element pages, which were recently introduced to celebrate the 150th anniversary of the periodic table. These services help users navigate the abundant chemical element data available within PubChem, while providing a convenient entry point to explore additional chemical content, such as biological activities and health and safety data available in PubChem Compound pages for specific elements and their isotopes. The PubChem Periodic Table and Element pages are also available as widgets, which enable web developers to display PubChem's element data on web pages they design. The elemental data can be downloaded in common file formats and imported into data analysis programs (e.g., spreadsheet software, like Microsoft Excel and Google Sheets, and computer scripts, such as python and R). Overall, the PubChem Periodic Table and Element pages improve access to chemical element data from authoritative sources.

19.

FAIR chemical structures in the Journal of Cheminformatics.

Schymanski, Emma L; Bolton, Evan E.

J Cheminform ; 13(1): 50, 2021 Jul 07.

Article in English | MEDLINE | ID: mdl-34229711

ABSTRACT

The ability to access chemical information openly is an essential part of many scientific disciplines. The Journal of Cheminformatics is leading the way for rigorous, open cheminformatics in many ways, but there remains room for improvement in primary areas. This letter discusses how both authors and the journal alike can help increase the FAIRness (Findability, Accessibility, Interoperability, Reusability) of the chemical structural information in the journal. A proposed chemical structure template can serve as an interoperable Additional File format (already accessible), made more findable by linking the DOI of this data file to the article DOI metadata, supporting further reuse.

20.

Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag.

Schymanski, Emma L; Kondic, Todor; Neumann, Steffen; Thiessen, Paul A; Zhang, Jian; Bolton, Evan E.

J Cheminform ; 13(1): 19, 2021 Mar 08.

Article in English | MEDLINE | ID: mdl-33685519

ABSTRACT

Compound (or chemical) databases are an invaluable resource for many scientific disciplines. Exposomics researchers need to find and identify relevant chemicals that cover the entirety of potential (chemical and other) exposures over entire lifetimes. This daunting task, with over 100 million chemicals in the largest chemical databases, coupled with broadly acknowledged knowledge gaps in these resources, leaves researchers faced with too much-yet not enough-information at the same time to perform comprehensive exposomics research. Furthermore, the improvements in analytical technologies and computational mass spectrometry workflows coupled with the rapid growth in databases and increasing demand for high throughput "big data" services from the research community present significant challenges for both data hosts and workflow developers. This article explores how to reduce candidate search spaces in non-target small molecule identification workflows, while increasing content usability in the context of environmental and exposomics analyses, so as to profit from the increasing size and information content of large compound databases, while increasing efficiency at the same time. In this article, these methods are explored using PubChem, the NORMAN Network Suspect List Exchange and the in silico fragmentation approach MetFrag. A subset of the PubChem database relevant for exposomics, PubChemLite, is presented as a database resource that can be (and has been) integrated into current workflows for high resolution mass spectrometry. Benchmarking datasets from earlier publications are used to show how experimental knowledge and existing datasets can be used to detect and fill gaps in compound databases to progressively improve large resources such as PubChem, and topic-specific subsets such as PubChemLite. PubChemLite is a living collection, updating as annotation content in PubChem is updated, and exported to allow direct integration into existing workflows such as MetFrag. The source code and files necessary to recreate or adjust this are jointly hosted between the research parties (see data availability statement). This effort shows that enhancing the FAIRness (Findability, Accessibility, Interoperability and Reusability) of open resources can mutually enhance several resources for whole community benefit. The authors explicitly welcome additional community input on ideas for future developments.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL