Search | VHL Regional Portal

Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites.

Huckvale, Erik D; Powell, Christian D; Jin, Huan; Moseley, Hunter N B.

Metabolites ; 13(11)2023 Nov 01.

Article in English | MEDLINE | ID: mdl-37999216

ABSTRACT

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.

Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites.

Huckvale, Erik D; Powell, Christian D; Jin, Huan; Moseley, Hunter N B.

bioRxiv ; 2023 Oct 09.

Article in English | MEDLINE | ID: mdl-37873272

ABSTRACT

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1-score of 0.8180 and Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.

The metabolomics workbench file status website: a metadata repository promoting FAIR principles of metabolomics data.

Powell, Christian D; Moseley, Hunter N B.

BMC Bioinformatics ; 24(1): 299, 2023 Jul 24.

Article in English | MEDLINE | ID: mdl-37482620

ABSTRACT

BACKGROUND: An updated version of the mwtab Python package for programmatic access to the Metabolomics Workbench (MetabolomicsWB) data repository was released at the beginning of 2021. Along with updating the package to match the changes to MetabolomicsWB's 'mwTab' file format specification and enhancing the package's functionality, the included validation facilities were used to detect and catalog file inconsistencies and errors across all publicly available datasets in MetabolomicsWB. RESULTS: The MetabolomicsWB File Status website was developed to provide continuous validation of MetabolomicsWB data files and a useful interface to all found inconsistencies and errors. This list of detectable issues/errors include format parsing errors, format compliance issues, access problems via MetabolomicsWB's REST interface, and other small inconsistencies that can hinder reusability. The website uses the mwtab Python package to pull down and validate each available analysis file and then generates an html report. The website is updated on a weekly basis. Moreover, the Python website design utilizes GitHub and GitHub.io, providing an easy to replicate template for implementing other metadata, virtual, and meta- repositories. CONCLUSIONS: The MetabolomicsWB File Status website provides a metadata repository of validation metadata to promote the FAIR use of existing metabolomics datasets from the MetabolomicsWB data repository.

Subject(s)

Metadata , Software , Metabolomics , Information Storage and Retrieval

Identifying and sharing per-and polyfluoroalkyl substances hot-spot areas and exposures in drinking water.

Ojha, Sweta; Thompson, P Travis; Powell, Christian D; Moseley, Hunter N B; Pennell, Kelly G.

Sci Data ; 10(1): 388, 2023 06 16.

Article in English | MEDLINE | ID: mdl-37328532

ABSTRACT

Exposure to per- and polyfluoroalkyl substances (PFAS) in drinking water is widely recognized as a public health concern. Decision-makers who are responsible for managing PFAS drinking water risks lack the tools to acquire the information they need. In response to this need, we provide a detailed description of a Kentucky dataset that allows decision-makers to visualize potential hot-spot areas and evaluate drinking water systems that may be susceptible to PFAS contamination. The dataset includes information extracted from publicly available sources to create five different maps in ArcGIS Online and highlights potential sources of PFAS contamination in the environment in relation to drinking water systems. As datasets of PFAS drinking water sampling continue to grow as part of evolving regulatory requirements, we used this Kentucky dataset as an example to promote the reuse of this dataset and others like it. We incorporated the FAIR (Findable, Accessible, Interoperable, and Reusable) principles by creating a Figshare item that includes all data and associated metadata with these five ArcGIS maps.

Subject(s)

Drinking Water , Fluorocarbons , Water Pollutants, Chemical , Drinking Water/analysis , Water Pollutants, Chemical/analysis , Fluorocarbons/analysis , Public Health , Base Sequence

A proposed FAIR approach for disseminating geospatial information system maps.

Thompson, P Travis; Ojha, Sweta; Powell, Christian D; Pennell, Kelly G; Moseley, Hunter N B.

Sci Data ; 10(1): 389, 2023 06 16.

Article in English | MEDLINE | ID: mdl-37328607

ABSTRACT

We present a draft Minimum Information About Geospatial Information System (MIAGIS) standard for facilitating public deposition of geospatial information system (GIS) datasets that follows the FAIR (Findable, Accessible, Interoperable and Reusable) principles. The draft MIAGIS standard includes a deposition directory structure and a minimum javascript object notation (JSON) metadata formatted file that is designed to capture critical metadata describing GIS layers and maps as well as their sources of data and methods of generation. The associated miagis Python package facilitates the creation of this MIAGIS metadata file and directly supports metadata extraction from both Esri JSON and GEOJSON GIS data formats plus options for extraction from user-specified JSON formats. We also demonstrate their use in crafting two example depositions of ArcGIS generated maps. We hope this draft MIAGIS standard along with the supporting miagis Python package will assist in establishing a GIS standards group that will develop the draft into a full standard for the wider GIS community as well as a future public repository for GIS datasets.

Subject(s)

Information Systems , Metadata

Academic Tracker: Software for tracking and reporting publications associated with authors and grants.

Thompson, P Travis; Powell, Christian D; Moseley, Hunter N B.

PLoS One ; 17(11): e0277834, 2022.

Article in English | MEDLINE | ID: mdl-36399468

ABSTRACT

In recent years, United States federal funding agencies, including the National Institutes of Health (NIH) and the National Science Foundation (NSF), have implemented public access policies to make research supported by funding from these federal agencies freely available to the public. Enforcement is primarily through annual and final reports submitted to these funding agencies, where all peer-reviewed publications must be registered through the appropriate mechanism as required by the specific federal funding agency. Unreported and/or incorrectly reported papers can result in delayed acceptance of annual and final reports and even funding delays for current and new research grants. So, it's important to make sure every peer-reviewed publication is reported properly and in a timely manner. For large collaborative research efforts, the tracking and proper registration of peer-reviewed publications along with generation of accurate annual and final reports can create a large administrative burden. With large collaborative teams, it is easy for these administrative tasks to be overlooked, forgotten, or lost in the shuffle. In order to help with this reporting burden, we have developed the Academic Tracker software package, implemented in the Python 3 programming language and supporting Linux, Windows, and Mac operating systems. Academic Tracker helps with publication tracking and reporting by comprehensively searching major peer-reviewed publication tracking web portals, including PubMed, Crossref, ORCID, and Google Scholar, given a list of authors. Academic Tracker provides highly customizable reporting templates so information about the resulting publications is easily transformed into appropriate formats for tracking and reporting purposes. The source code and extensive documentation is hosted on GitHub (https://moseleybioinformaticslab.github.io/academic_tracker/) and is also available on the Python Package Index (https://pypi.org/project/academic_tracker) for easy installation.

Subject(s)

Financing, Organized , National Institutes of Health (U.S.) , United States , PubMed , Software , Peer Review

The mwtab Python Library for RESTful Access and Enhanced Quality Control, Deposition, and Curation of the Metabolomics Workbench Data Repository.

Powell, Christian D; Moseley, Hunter N B.

Metabolites ; 11(3)2021 Mar 12.

Article in English | MEDLINE | ID: mdl-33808985

ABSTRACT

The Metabolomics Workbench (MW) is a public scientific data repository consisting of experimental data and metadata from metabolomics studies collected with mass spectroscopy (MS) and nuclear magnetic resonance (NMR) analyses. MW has been constantly evolving; updating its 'mwTab' text file format, adding a JavaScript Object Notation (JSON) file format, implementing a REpresentational State Transfer (REST) interface, and nearly quadrupling the number of datasets hosted on the repository within the last three years. In order to keep up with the quickly evolving state of the MW repository, the 'mwtab' Python library and package have been continuously updated to mirror the changes in the 'mwTab' and JSONized formats and contain many new enhancements including methods for interacting with the MW REST interface, enhanced format validation features, and advanced features for parsing and searching for specific metabolite data and metadata. We used the enhanced format validation features to evaluate all available datasets in MW to facilitate improved curation and FAIRness of the repository. The 'mwtab' Python package is now officially released as version 1.0.1 and is freely available on GitHub and the Python Package Index (PyPI) under a Clear Berkeley Software Distribution (BSD) license with documentation available on ReadTheDocs.

Entropy based analysis of vertebrate sperm protamines sequences: evidence of potential dityrosine and cysteine-tyrosine cross-linking in sperm protamines.

Powell, Christian D; Kirchoff, Daniel C; DeRouchey, Jason E; Moseley, Hunter N B.

BMC Genomics ; 21(1): 277, 2020 Apr 03.

Article in English | MEDLINE | ID: mdl-32245406

ABSTRACT

BACKGROUND: Spermatogenesis is the process by which germ cells develop into spermatozoa in the testis. Sperm protamines are small, arginine-rich nuclear proteins which replace somatic histones during spermatogenesis, allowing a hypercondensed DNA state that leads to a smaller nucleus and facilitating sperm head formation. In eutherian mammals, the protamine-DNA complex is achieved through a combination of intra- and intermolecular cysteine cross-linking and possibly histidine-cysteine zinc ion binding. Most metatherian sperm protamines lack cysteine but perform the same function. This lack of dicysteine cross-linking has made the mechanism behind metatherian protamines folding unclear. RESULTS: Protamine sequences from UniProt's databases were pulled down and sorted into homologous groups. Multiple sequence alignments were then generated and a gap weighted relative entropy score calculated for each position. For the eutherian alignments, the cysteine containing positions were the most highly conserved. For the metatherian alignment, the tyrosine containing positions were the most highly conserved and corresponded to the cysteine positions in the eutherian alignment. CONCLUSIONS: High conservation indicates likely functionally/structurally important residues at these positions in the metatherian protamines and the correspondence with cysteine positions within the eutherian alignment implies a similarity in function. One possible explanation is that the metatherian protamine structure relies upon dityrosine cross-linking between these highly conserved tyrosines. Also, the human protamine P1 sequence has a tyrosine substitution in a position expecting eutherian dicysteine cross-linking. Similarly, some members of the metatherian Planigales genus contain cysteine substitutions in positions expecting plausible metatherian dityrosine cross-linking. Rare cysteine-tyrosine cross-linking could explain both observations.

Subject(s)

Computational Biology/methods , DNA/metabolism , Protamines/chemistry , Protamines/metabolism , Spermatozoa/metabolism , Amino Acid Sequence , Animals , Binding Sites , Conserved Sequence , Cysteine/metabolism , Entropy , Eutheria , Male , Protamines/genetics , Protein Binding , Protein Folding , Sequence Alignment , Tyrosine/analogs & derivatives , Tyrosine/metabolism

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL