Pesquisa | Portal Regional da BVS (teste)

Predicting the Pathway Involvement of Metabolites Based on Combined Metabolite and Pathway Features.

Huckvale, Erik D; Moseley, Hunter N B.

Metabolites ; 14(5)2024 May 07.

Artigo em Inglês | MEDLINE | ID: mdl-38786743

RESUMO

A major limitation of most metabolomics datasets is the sparsity of pathway annotations for detected metabolites. It is common for less than half of the identified metabolites in these datasets to have a known metabolic pathway involvement. Trying to address this limitation, machine learning models have been developed to predict the association of a metabolite with a "pathway category", as defined by a metabolic knowledge base like KEGG. Past models were implemented as a single binary classifier specific to a single pathway category, requiring a set of binary classifiers for generating the predictions for multiple pathway categories. This past approach multiplied the computational resources necessary for training while diluting the positive entries in the gold standard datasets needed for training. To address these limitations, we propose a generalization of the metabolic pathway prediction problem using a single binary classifier that accepts the features both representing a metabolite and representing a pathway category and then predicts whether the given metabolite is involved in the corresponding pathway category. We demonstrate that this metabolite-pathway features pair approach not only outperforms the combined performance of training separate binary classifiers but demonstrates an order of magnitude improvement in robustness: a Matthews correlation coefficient of 0.784 ± 0.013 versus 0.768 ± 0.154.

A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement.

Huckvale, Erik D; Moseley, Hunter N B.

PLoS One ; 19(5): e0299583, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38696410

RESUMO

The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

Assuntos

Redes e Vias Metabólicas , Aprendizado de Máquina Supervisionado , Humanos , Conjuntos de Dados como Assunto

Predicting The Pathway Involvement Of Metabolites Based on Combined Metabolite and Pathway Features.

Huckvale, Erik D; Moseley, Hunter N B.

bioRxiv ; 2024 Apr 02.

Artigo em Inglês | MEDLINE | ID: mdl-38617261

RESUMO

A major limitation of most metabolomics datasets is the sparsity of pathway annotations of detected metabolites. It is common for less than half of identified metabolites in these datasets to have known metabolic pathway involvement. Trying to address this limitation, machine learning models have been developed to predict the association of a metabolite with a "pathway category", as defined by one of the metabolic knowledgebases like the Kyoto Encyclopedia of Gene and Genomes. Most of these models are implemented as a single binary classifier specific to a single pathway category, requiring a set of binary classifiers for generating predictions for multiple pathway categories. This single binary classifier per pathway category approach both multiplies the computational resources necessary for training while diluting the positive entries in gold standard datasets needed for training. To address the limitations of training separate classifiers, we propose a generalization of the metabolic pathway prediction problem using a single binary classifier that accepts both features representing a metabolite and features representing a generic pathway category and then predicts whether the given metabolite is involved in the corresponding pathway category. We demonstrate that this metabolite-pathway features-pair approach is not only competitive with the combined performance of training separate binary classifiers, but it outperforms the previous benchmark models.

Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites.

Huckvale, Erik D; Powell, Christian D; Jin, Huan; Moseley, Hunter N B.

Metabolites ; 13(11)2023 Nov 01.

Artigo em Inglês | MEDLINE | ID: mdl-37999216

RESUMO

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.

Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites.

Huckvale, Erik D; Powell, Christian D; Jin, Huan; Moseley, Hunter N B.

bioRxiv ; 2023 Oct 09.

Artigo em Inglês | MEDLINE | ID: mdl-37873272

RESUMO

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1-score of 0.8180 and Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.

Pairwise Correlation Analysis of the Alzheimer's Disease Neuroimaging Initiative (ADNI) Dataset Reveals Significant Feature Correlation.

Huckvale, Erik D; Hodgman, Matthew W; Greenwood, Brianna B; Stucki, Devorah O; Ward, Katrisa M; Ebbert, Mark T W; Kauwe, John S K; Miller, Justin B.

Genes (Basel) ; 12(11)2021 10 21.

Artigo em Inglês | MEDLINE | ID: mdl-34828267

RESUMO

The Alzheimer's Disease Neuroimaging Initiative (ADNI) contains extensive patient measurements (e.g., magnetic resonance imaging [MRI], biometrics, RNA expression, etc.) from Alzheimer's disease (AD) cases and controls that have recently been used by machine learning algorithms to evaluate AD onset and progression. While using a variety of biomarkers is essential to AD research, highly correlated input features can significantly decrease machine learning model generalizability and performance. Additionally, redundant features unnecessarily increase computational time and resources necessary to train predictive models. Therefore, we used 49,288 biomarkers and 793,600 extracted MRI features to assess feature correlation within the ADNI dataset to determine the extent to which this issue might impact large scale analyses using these data. We found that 93.457% of biomarkers, 92.549% of the gene expression values, and 100% of MRI features were strongly correlated with at least one other feature in ADNI based on our Bonferroni corrected α (p-value ≤ 1.40754 × 10-13). We provide a comprehensive mapping of all ADNI biomarkers to highly correlated features within the dataset. Additionally, we show that significant correlation within the ADNI dataset should be resolved before performing bulk data analyses, and we provide recommendations to address these issues. We anticipate that these recommendations and resources will help guide researchers utilizing the ADNI dataset to increase model performance and reduce the cost and complexity of their analyses.

Assuntos

Doença de Alzheimer/diagnóstico , Doença de Alzheimer/genética , Estudos de Associação Genética , Neuroimagem , Transcriptoma , Doença de Alzheimer/epidemiologia , Doença de Alzheimer/terapia , Biomarcadores/análise , Conjuntos de Dados como Assunto/estatística & dados numéricos , Estudos de Associação Genética/estatística & dados numéricos , Humanos , Aprendizado de Máquina , Imageamento por Ressonância Magnética/métodos , Neuroimagem/métodos , Neuroimagem/estatística & dados numéricos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA