Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences (preprint)

Sarwan Ali; Pin-Yu Chen; Murray Patterson

This article is a Preprint

Preprints are preliminary research reports that have not been certified by peer review. They should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Preprints posted online allow authors to receive rapid feedback and the entire scientific community can appraise the work for themselves and respond appropriately. Those comments are posted alongside the preprints for anyone to read them and serve as a post publication assessment.

Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences (preprint)

Sarwan Ali; Pin-Yu Chen; Murray Patterson.

biorxiv; 2023.

Preprint in English | bioRxiv | ID: ppzbmed-10.1101.2023.08.24.554651

ABSTRACT

ABSTRACT

In the midst of the global COVID-19 pandemic, a wealth of data has become available to researchers, presenting a unique opportunity to investigate the behavior of the virus. This research aims to facilitate the design of efficient vaccinations and proactive measures to prevent future pandemics through the utilization of machine learning (ML) models for decision-making processes. Consequently, ensuring the reliability of ML predictions in these critical and rapidly evolving scenarios is of utmost importance. Notably, studies focusing on the genomic sequences of individuals infected with the coronavirus have revealed that the majority of variations occur within a specific region known as the spike (or S) protein. Previous research has explored the analysis of spike proteins using various ML techniques, including classification and clustering of variants. However, it is imperative to acknowledge the possibility of errors in spike proteins, which could lead to misleading outcomes and misguide decision-making authorities. Hence, a comprehensive examination of the robustness of ML and deep learning models in classifying spike sequences is essential. In this paper, we propose a framework for evaluating and benchmarking the robustness of diverse ML methods in spike sequence classification. Through extensive evaluation of a wide range of ML algorithms, ranging from classical methods like naive Bayes and logistic regression to advanced approaches such as deep neural networks, our research demonstrates that utilizing k-mers for creating the feature vector representation of spike proteins is more effective than traditional one-hot encoding-based embedding methods. Additionally, our findings indicate that deep neural networks exhibit superior accuracy and robustness compared to non-deep-learning baselines. To the best of our knowledge, this study is the first to benchmark the accuracy and robustness of machine-learning classification models against various types of random corruptions in COVID-19 spike protein sequences. The benchmarking framework established in this research holds the potential to assist future researchers in gaining a deeper understanding of the behavior of the coronavirus, enabling the implementation of proactive measures and the prevention of similar pandemics in the future.

Subject(s)

COVID-19; Learning Disabilities

Fulltext

XML

Search on Google

Full text: Available Collection: Preprints Database: bioRxiv Main subject: COVID-19 / Learning Disabilities Language: English Year: 2023 Document Type: Preprint

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

Search on Google

Full text: Available Collection: Preprints Database: bioRxiv Main subject: COVID-19 / Learning Disabilities Language: English Year: 2023 Document Type: Preprint