Search | VHL Regional Portal

A general procedure for finding potentially erroneous entries in the database of retention indices.

Khrisanfov, Mikhail D; Matyushin, Dmitriy D; Samokhin, Andrey S.

Anal Chim Acta ; 1297: 342375, 2024 Apr 08.

Article in English | MEDLINE | ID: mdl-38438243

ABSTRACT

BACKGROUND: The NIST retention index database is one the most widely used sources of retention indices. In both untargeted analysis and machine learning studies filtering for potential errors is rather lacking or nonexistent. According to our estimates about 80% of the compounds from both NIST 17 and NIST 20 retention index databases have only one RI value per stationary phase, which makes searching for erroneous values with statistical methods impossible. Manual inspection is also impractical because the database contains more than 300 000 entries. RESULTS: We suggest a two-step procedure to find potentially erroneous retention indices based on machine learning. The first step is to use five predictive models to obtain predicted retention index values for the whole database. The second one is to compare these predicted values against the experimental ones. We consider a retention index erroneous if its accuracy (the difference between predicted and experimental value) is in the bottom 5% for each of the five models simultaneously. Using this method, we were able to detect 2093 outlier entries for standard and semi-standard non-polar stationary phases in the NIST 17 retention index database, 566 of those were corrected or removed by the developers in the NIST 20. SIGNIFICANCE: This is a novel approach to find potentially erroneous entries in a large-scale database with mostly unique entries, which can be applied not only to retention indices. The procedure can help filter and report mishandled data to improve the quality of the dataset for machine learning applications and experimental use.

How searching against multiple libraries can lead to biased results in GC/MS-based metabolomics.

Samokhin, Andrey S; Matyushin, Dmitriy D.

Rapid Commun Mass Spectrom ; 37(3): e9437, 2023 Feb 15.

Article in English | MEDLINE | ID: mdl-36409456

ABSTRACT

RATIONALE: Databases of electron ionization mass spectra are often used in GC/MS-based untargeted metabolomics analysis. The results of the library search depend on several factors, such as the size and quality of the database, and the library search algorithm. We found out that the list of considered m/z values is another important parameter. Unfortunately, this information is not usually specified by software developers and it is hidden from the end user. METHODS: We created synthetic data sets and figured out how several popular software products (AMDIS, ChromaTOF, MS Search, and Xcalibur) select the list of m/z values for the library search. Moreover, we considered data sets of real mass spectra (presented in both the NIST and FiehnLib libraries) and compared the library search results obtained within different software products. All programs under consideration use the NIST MS Search binaries to perform the library search using the Identity algorithm. RESULTS: We found that AMDIS and ChromaTOF can give biased library search results under particular conditions. In untargeted metabolomics, this can happen when NIST and FiehnLib libraries are used simultaneously, the scan range of the instrument is less than 85, and the correct answer is present only in the FiehnLib library. CONCLUSIONS: The main reason for biased results is that the information about the scan range is not stored in the metadata of library records. As a result, in the case of AMDIS and ChromaTOF software, some unrecorded peaks are considered as missing during the library search, the respective compound is penalized, and the correct answer falls outside the top five or even top 10 hits. At the same time, the default algorithm for selecting the list of considered m/z values implemented in MS Search is free from such unexpected behavior.

Subject(s)

Algorithms , Software , Gas Chromatography-Mass Spectrometry/methods , Mass Spectrometry/methods , Metabolomics/methods

Multivariate Prognostic Model for Predicting the Outcome of Critically Ill Patients Using the Aromatic Metabolites Detected by Gas Chromatography-Mass Spectrometry.

Pautova, Alisa K; Samokhin, Andrey S; Beloborodova, Natalia V; Revelsky, Alexander I.

Molecules ; 27(15)2022 Jul 26.

Article in English | MEDLINE | ID: mdl-35897959

ABSTRACT

A number of aromatic metabolites of tyrosine and phenylalanine have been investigated as new perspective markers of infectious complications in the critically ill patients of intensive care units (ICUs). The goal of our research was to build a multivariate model for predicting the outcome of critically ill patients regardless of the main pathology on the day of admission to the ICU. Eight aromatic metabolites were detected in serum using gas chromatography-mass spectrometry. The samples were obtained from the critically ill patients (n = 79), including survivors (n = 44) and non-survivors (n = 35), and healthy volunteers (n = 52). The concentrations of aromatic metabolites were statistically different in the critically ill patients and healthy volunteers. A univariate model for predicting the outcome of the critically ill patients was based on 3-(4-hydroxyphenyl)lactic acid (p-HPhLA). Two multivariate classification models were built based on aromatic metabolites using SIMCA method. The predictive models were compared with the clinical APACHE II scale using ROC analysis. For all of the predictive models the areas under the ROC curve were close to one. The aromatic metabolites (one or a number of them) can be used in clinical practice for the prognosis of the outcome of critically ill patients on the day of admission to the ICU.

Subject(s)

Critical Illness , Sepsis , APACHE , Gas Chromatography-Mass Spectrometry , Humans , Intensive Care Units , Prognosis , ROC Curve

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL