Compression-Complexity Measures for Analysis and Classification of Coronaviruses.

Munagala, Naga Venkata Trinath Sai; Amanchi, Prem Kumar; Balasubramanian, Karthi; Panicker, Athira; Nagaraj, Nithin

Munagala, Naga Venkata Trinath Sai; Amanchi, Prem Kumar; Balasubramanian, Karthi; Panicker, Athira; Nagaraj, Nithin.

Munagala NVTS; Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India.
Amanchi PK; Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India.
Balasubramanian K; Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India.
Panicker A; Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India.
Nagaraj N; Consciousness Studies Programme, National Institute of Advanced Studies, Bengaluru 560012, Karnataka, India.

Entropy (Basel) ; 25(1)2022 Dec 31.

Article in English | MEDLINE | ID: covidwho-2233544

ABSTRACT

ABSTRACT

Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.

Keywords

COVID-19; Effort-to-Compress complexity; Lempel-Ziv complexity; compression-complexity measures; distance measure; machine learning

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Type of study: Experimental Studies Topics: Vaccines Language: English Year: 2022 Document Type: Article Affiliation country: E25010081

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google