A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data.

Pechlivanis, Nikolaos; Togkousidis, Anastasios; Tsagiopoulou, Maria; Sgardelis, Stefanos; Kappas, Ilias; Psomopoulos, Fotis

Pechlivanis, Nikolaos; Togkousidis, Anastasios; Tsagiopoulou, Maria; Sgardelis, Stefanos; Kappas, Ilias; Psomopoulos, Fotis.

Pechlivanis N; Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece.
Togkousidis A; Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece.
Tsagiopoulou M; Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece.
Sgardelis S; Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece.
Kappas I; Department of Ecology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece.
Psomopoulos F; Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece.

Front Genet ; 12: 618170, 2021.

Article in English | MEDLINE | ID: covidwho-1389158

ABSTRACT

ABSTRACT

The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https//github.com/BiodataAnalysisGroup/kmerAnalyzer.

Keywords

SARS-CoV-2; feature selection; k-mers; phylogenetics; unsupervised learning

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Type of study: Prognostic study Language: English Journal: Front Genet Year: 2021 Document Type: Article Affiliation country: FGENE.2021.618170

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google