Your browser doesn't support javascript.
Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches.
Miao, Miao; De Clercq, Erik; Li, Guangdi.
  • Miao M; Hunan Provincial Key Laboratory of Clinical Epidemiology, Xiangya School of Public Health, Central South University, Changsha 410078, China.
  • De Clercq E; Department of Microbiology, Immunology and Transplantation, Rega Institute for Medical Research, KU Leuven, 3000 Leuven, Belgium.
  • Li G; Hunan Provincial Key Laboratory of Clinical Epidemiology, Xiangya School of Public Health, Central South University, Changsha 410078, China.
Microorganisms ; 10(9)2022 Sep 04.
Article in English | MEDLINE | ID: covidwho-2010211
ABSTRACT
Despite the active development of SARS-CoV-2 surveillance methods (e.g., Nextstrain, GISAID, Pangolin), the global emergence of various SARS-CoV-2 viral lineages that potentially cause antiviral and vaccine failure has driven the need for accurate and efficient SARS-CoV-2 genome sequence classifiers. This study presents an optimized method that accurately identifies the viral lineages of SARS-CoV-2 genome sequences using existing schemes. For Nextstrain and GISAID clades, a template matching-based method is proposed to quantify the differences between viral clades and to play an important role in classification evaluation. Furthermore, to improve the typing accuracy of SARS-CoV-2 genome sequences, an ensemble model that integrates a combination of machine learning-based methods (such as Random Forest and Catboost) with optimized weights is proposed for Nextstrain, Pangolin, and GISAID clades. Cross-validation is applied to optimize the parameters of the machine learning-based method and the weight settings of the ensemble model. To improve the efficiency of the model, in addition to the one-hot encoding method, we have proposed a nucleotide site mutation-based data structure that requires less computational resources and performs better in SARS-CoV-2 genome sequence typing. Based on an accumulated database of >1 million SARS-CoV-2 genome sequences, performance evaluations show that the proposed system has a typing accuracy of 99.879%, 97.732%, and 96.291% for Nextstrain, Pangolin, and GISAID clades, respectively. A single prediction only takes an average of <20 ms on a portable laptop. Overall, this study provides an efficient and accurate SARS-CoV-2 genome sequence typing system that benefits current and future surveillance of SARS-CoV-2 variants.
Keywords

Full text: Available Collection: International databases Database: MEDLINE Type of study: Experimental Studies / Prognostic study / Randomized controlled trials Topics: Vaccines / Variants Language: English Year: 2022 Document Type: Article Affiliation country: Microorganisms10091785

Similar

MEDLINE

...
LILACS

LIS


Full text: Available Collection: International databases Database: MEDLINE Type of study: Experimental Studies / Prognostic study / Randomized controlled trials Topics: Vaccines / Variants Language: English Year: 2022 Document Type: Article Affiliation country: Microorganisms10091785