NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations.

Kim, Juhyeon; Cheon, Saeyeon; Ahn, Insung

Kim, Juhyeon; Cheon, Saeyeon; Ahn, Insung.

Kim J; Department of Data-Centric Problem Solving Research, Korea Institute of Science and Technology Information, Yuseong-gu, Daejeon, Korea.
Cheon S; Center for Convergent Research of Emerging Virus Infection, Korea Research Institute of Chemical Technology, Yuseong-gu, Daejeon, Korea.
Ahn I; Department of Industrial Engineering, Ajou University, Suwon, South Korea.

BMC Bioinformatics ; 23(1): 187, 2022 May 17.

Article in English | MEDLINE | ID: covidwho-1846792

ABSTRACT

ABSTRACT

The rapid global spread and dissemination of SARS-CoV-2 has provided the virus with numerous opportunities to develop several variants. Thus, it is critical to determine the degree of the variations and in which part of the virus those variations occurred. Therefore, in this study, methods that could be used to vectorize the sequence data, perform clustering analysis, and visualize the results were proposed using machine learning methods. To conduct this study, a total of 224,073 cases of SARS-CoV-2 sequence data were collected through NCBI and GISAID, and the data were visualized using dimensionality reduction and clustering analysis models such as T-SNE and DBSCAN. The SARS-CoV-2 virus, which was first detected, was distinguished from different variations, including Omicron and Delta, in the cluster results. Furthermore, it was possible to examine which codon changes in the spike protein caused the variants to be distinguished using feature importance extraction models such as Random Forest or Shapely Value. The proposed method has the advantage of being able to analyse and visualize a large amount of data at once compared to the existing tree-based sequence data analysis. The proposed method was able to identify and visualize significant changes between the SARS-CoV-2 virus, which was first detected in Wuhan, China, in December 2019, and the newly formed mutant virus group. As a result of clustering analysis using sequence data, it was possible to confirm the formation of clusters among various variants in a two-dimensional graph, and by extracting the importance of variables, it was possible to confirm which codon changes played a major role in distinguishing variants. Furthermore, since the proposed method can handle a variety of data sequences, it can be used for all kinds of diseases, including influenza and SARS-CoV-2. Therefore, the proposed method has the potential to become widely used for the effective analysis of disease variations.

Subject(s)

COVID-19; Magnoliopsida; Cluster Analysis; Codon; Machine Learning; SARS-CoV-2/genetics

Keywords

Clustering; Density based spatial clustering of applications with noise; Feature selection; Protein sequence analysis; Random forest; SARS-CoV-2; Sequence data pre-process; Shapely value; t-Stochastic neighbour embedding

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Main subject: Magnoliopsida / COVID-19 Type of study: Randomized controlled trials Topics: Variants Language: English Journal: BMC Bioinformatics Journal subject: Medical Informatics Year: 2022 Document Type: Article

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google