Your browser doesn't support javascript.
Accurate and fast clade assignment via deep learning and frequency chaos game representation.
Avila Cartes, Jorge; Anand, Santosh; Ciccolella, Simone; Bonizzoni, Paola; Della Vedova, Gianluca.
  • Avila Cartes J; Department of Computer Science, Systems and Communications, University of Milano-Bicocca, Milan 20125, Italy.
  • Anand S; Department of Computer Science, Systems and Communications, University of Milano-Bicocca, Milan 20125, Italy.
  • Ciccolella S; Department of Computer Science, Systems and Communications, University of Milano-Bicocca, Milan 20125, Italy.
  • Bonizzoni P; Department of Computer Science, Systems and Communications, University of Milano-Bicocca, Milan 20125, Italy.
  • Della Vedova G; Department of Computer Science, Systems and Communications, University of Milano-Bicocca, Milan 20125, Italy.
Gigascience ; 122022 12 28.
Article in English | MEDLINE | ID: covidwho-2313424
ABSTRACT

BACKGROUND:

Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade.

RESULTS:

In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29\%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12\%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants.

CONCLUSIONS:

By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants.

AVAILABILITY:

The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https//huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https//github.com/AlgoLab/CouGaR-g under the GPL.
Subject(s)
Keywords

Full text: Available Collection: International databases Database: MEDLINE Main subject: Puma / Deep Learning / COVID-19 Type of study: Prognostic study / Randomized controlled trials Topics: Variants Limits: Animals Language: English Year: 2022 Document Type: Article Affiliation country: Gigascience

Similar

MEDLINE

...
LILACS

LIS


Full text: Available Collection: International databases Database: MEDLINE Main subject: Puma / Deep Learning / COVID-19 Type of study: Prognostic study / Randomized controlled trials Topics: Variants Limits: Animals Language: English Year: 2022 Document Type: Article Affiliation country: Gigascience