Search | VHL Regional Portal

Seedability: optimizing alignment parameters for sensitive sequence comparison.

Ayad, Lorraine A K; Chikhi, Rayan; Pissis, Solon P.

Bioinform Adv ; 3(1): vbad108, 2023.

Article in English | MEDLINE | ID: mdl-37621456

ABSTRACT

Motivation: Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2, use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability, a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results: The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation: https://github.com/lorrainea/Seedability (distributed under GPL v3.0).

IsoXpressor: A Tool to Assess Transcriptional Activity within Isochores.

Ayad, Lorraine A K; Dourou, Athanasia-Maria; Arhondakis, Stilianos; Pissis, Solon P.

Genome Biol Evol ; 12(9): 1573-1578, 2020 09 01.

Article in English | MEDLINE | ID: mdl-32857856

ABSTRACT

Genomes are characterized by large regions of homogeneous base compositions known as isochores. The latter are divided into GC-poor and GC-rich classes linked to distinct functional and structural properties. Several studies have addressed how isochores shape function and structure. To aid in this important subject, we present IsoXpressor, a tool designed for the analysis of the functional property of transcription within isochores. IsoXpressor allows users to process RNA-Seq data in relation to the isochores, and it can be employed to investigate any biological question of interest for any species. The results presented herein as proof of concept are focused on the preimplantation process in Homo sapiens (human) and Macaca mulatta (rhesus monkey).

Subject(s)

Genomics/methods , Isochores , Software , Transcription, Genetic , Animals , Humans , Macaca mulatta , Sequence Analysis, RNA

SMART: SuperMaximal approximate repeats tool.

Ayad, Lorraine A K; Charalampopoulos, Panagiotis; Pissis, Solon P.

Bioinformatics ; 36(8): 2589-2591, 2020 04 15.

Article in English | MEDLINE | ID: mdl-31873724

ABSTRACT

SUMMARY: State-of-the-art repeat analysis tools rely on extending maximal repeated pairs to enumerate maximal k-mismatch repeats. These pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied to speed up the extension. Here, we introduce supermaximal k-mismatch repeats, which are linear in n and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly, and show that these elements are statistically much more significant than the output of the state-of-the-art. AVAILABILITY AND IMPLEMENTATION: http://github.com/lorrainea/smart (GNU GPL v3.0). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

CNEFinder: finding conserved non-coding elements in genomes.

Ayad, Lorraine A K; Pissis, Solon P; Polychronopoulos, Dimitris.

Bioinformatics ; 34(17): i743-i747, 2018 09 01.

Article in English | MEDLINE | ID: mdl-30423090

ABSTRACT

Motivation: Conserved non-coding elements (CNEs) represent an enigmatic class of genomic elements which, despite being extremely conserved across evolution, do not encode for proteins. Their functions are still largely unknown. Thus, there exists a need to systematically investigate their roles in genomes. Towards this direction, identifying sets of CNEs in a wide range of organisms is an important first step. Currently, there are no tools published in the literature for systematically identifying CNEs in genomes. Results: We fill this gap by presenting CNEFinder; a tool for identifying CNEs between two given DNA sequences with user-defined criteria. The results presented here show the tool's ability of identifying CNEs accurately and efficiently. CNEFinder is based on a k-mer technique for computing maximal exact matches. The tool thus does not require or compute whole-genome alignments or indexes, such as the suffix array or the Burrows Wheeler Transform (BWT), which makes it flexible to use on a wide scale. Availability and implementation: Free software under the terms of the GNU GPL (https://github.com/lorrainea/CNEFinder).

Subject(s)

Genome , RNA, Untranslated/genetics , Sequence Analysis/methods , Software , Conserved Sequence/genetics , Humans

MARS: improving multiple circular sequence alignment using refined sequences.

Ayad, Lorraine A K; Pissis, Solon P.

BMC Genomics ; 18(1): 86, 2017 01 14.

Article in English | MEDLINE | ID: mdl-28088189

ABSTRACT

BACKGROUND: A fundamental assumption of all widely-used multiple sequence alignment techniques is that the left- and right-most positions of the input sequences are relevant to the alignment. However, the position where a sequence starts or ends can be totally arbitrary due to a number of reasons: arbitrariness in the linearisation (sequencing) of a circular molecular structure; or inconsistencies introduced into sequence databases due to different linearisation standards. These scenarios are relevant, for instance, in the process of multiple sequence alignment of mitochondrial DNA, viroid, viral or other genomes, which have a circular molecular structure. A solution for these inconsistencies would be to identify a suitable rotation (cyclic shift) for each sequence; these refined sequences may in turn lead to improved multiple sequence alignments using the preferred multiple sequence alignment program. RESULTS: We present MARS, a new heuristic method for improving Multiple circular sequence Alignment using Refined Sequences. MARS was implemented in the C++ programming language as a program to compute the rotations (cyclic shifts) required to best align a set of input sequences. Experimental results, using real and synthetic data, show that MARS improves the alignments, with respect to standard genetic measures and the inferred maximum-likelihood-based phylogenies, and outperforms state-of-the-art methods both in terms of accuracy and efficiency. Our results show, among others, that the average pairwise distance in the multiple sequence alignment of a dataset of widely-studied mitochondrial DNA sequences is reduced by around 5% when MARS is applied before a multiple sequence alignment is performed. CONCLUSIONS: Analysing multiple sequences simultaneously is fundamental in biological research and multiple sequence alignment has been found to be a popular method for this task. Conventional alignment techniques cannot be used effectively when the position where sequences start is arbitrary. We present here a method, which can be used in conjunction with any multiple sequence alignment program, to address this problem effectively and efficiently.

Subject(s)

Computational Biology/methods , DNA, Circular , Sequence Alignment , Sequence Analysis, DNA , Software , Algorithms , Molecular Sequence Annotation , Reproducibility of Results , Web Browser

libFLASM: a software library for fixed-length approximate string matching.

Ayad, Lorraine A K; Pissis, Solon P P; Retha, Ahmad.

BMC Bioinformatics ; 17(1): 454, 2016 Nov 10.

Article in English | MEDLINE | ID: mdl-27832739

ABSTRACT

BACKGROUND: Approximate string matching is the problem of finding all factors of a given text that are at a distance at most k from a given pattern. Fixed-length approximate string matching is the problem of finding all factors of a text of length n that are at a distance at most k from any factor of length â of a pattern of length m. There exist bit-vector techniques to solve the fixed-length approximate string matching problem in time [Formula: see text] and space [Formula: see text] under the edit and Hamming distance models, where w is the size of the computer word; as such these techniques are independent of the distance threshold k or the alphabet size. Fixed-length approximate string matching is a generalisation of approximate string matching and, hence, has numerous direct applications in computational molecular biology and elsewhere. RESULTS: We present and make available libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching under both the edit and the Hamming distance models. Moreover we describe how fixed-length approximate string matching is applied to solve real problems by incorporating libFLASM into established applications for multiple circular sequence alignment as well as single and structured motif extraction. Specifically, we describe how it can be used to improve the accuracy of multiple circular sequence alignment in terms of the inferred likelihood-based phylogenies; and we also describe how it is used to efficiently find motifs in molecular sequences representing regulatory or functional regions. The comparison of the performance of the library to other algorithms show how it is competitive, especially with increasing distance thresholds. CONCLUSIONS: Fixed-length approximate string matching is a generalisation of the classic approximate string matching problem. We present libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching. The extensive experimental results presented here suggest that other applications could benefit from using libFLASM, and thus further maintenance and development of libFLASM is desirable.

Subject(s)

Computational Biology/methods , Gene Library , Software , Algorithms , Databases as Topic , Likelihood Functions , Nucleotide Motifs/genetics , Sequence Alignment , Time Factors

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL