Benchmarking machine learning robustness in Covid-19 genome sequence classification.

Ali, Sarwan; Sahoo, Bikram; Zelikovsky, Alexander; Chen, Pin-Yu; Patterson, Murray

Ali, Sarwan; Sahoo, Bikram; Zelikovsky, Alexander; Chen, Pin-Yu; Patterson, Murray.

Ali S; Department of Computer Science, Georgia State University, Atlanta, GA, USA. sali85@student.gsu.edu.
Sahoo B; Department of Computer Science, Georgia State University, Atlanta, GA, USA.
Zelikovsky A; Department of Computer Science, Georgia State University, Atlanta, GA, USA.
Chen PY; IBM T. J. Watson Research Center, Yorktown Heights, Yorktown, NY, USA.
Patterson M; Department of Computer Science, Georgia State University, Atlanta, GA, USA.

Sci Rep ; 13(1): 4154, 2023 03 13.

Artículo en Inglés | MEDLINE | ID: covidwho-2249038

ABSTRACT

ABSTRACT

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.

Asunto(s)

Simulación por Computador; Genoma Viral; Aprendizaje Automático; Proyectos de Investigación; SARS-CoV-2; Aprendizaje Automático/normas; SARS-CoV-2/clasificación; SARS-CoV-2/genética; Genoma Viral/genética; Proteínas Virales/genética; COVID-19/virología; Análisis de Secuencia de ARN

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: Disponible Colección: Bases de datos internacionales Base de datos: MEDLINE Asunto principal: Proyectos de Investigación / Simulación por Computador / Genoma Viral / Aprendizaje Automático / SARS-CoV-2 Tipo de estudio: Estudio experimental / Estudio pronóstico / Ensayo controlado aleatorizado Idioma: Inglés Revista: Sci Rep Año: 2023 Tipo del documento: Artículo País de afiliación: S41598-023-31368-3

Similares

MEDLINE

LILACS

LIS

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google