Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 21
Filter
Add more filters










Publication year range
1.
BMC Bioinformatics ; 22(1): 51, 2021 Feb 06.
Article in English | MEDLINE | ID: mdl-33549041

ABSTRACT

BACKGROUND: An inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre. Inverted repeats are found in both prokaryotic and eukaryotic genomes and they have been linked with countless possible functions. Many international consortia provide a comprehensive description of common genetic variation making alternative sequence representations, such as IUPAC encoding, necessary for leveraging the full potential of such broad variation datasets. RESULTS: We present IUPACPAL, an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences allowing also for potential mismatches and gaps in the inverted repeats. CONCLUSION: Within the parameters that were tested, our experimental results show that IUPACPAL compares favourably to a similar application packaged with EMBOSS. We show that IUPACPAL identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.


Subject(s)
Genome , Prokaryotic Cells , Repetitive Sequences, Nucleic Acid , Base Sequence , Inverted Repeat Sequences , Repetitive Sequences, Nucleic Acid/genetics
2.
Bioinformatics ; 36(12): 3687-3692, 2020 06 01.
Article in English | MEDLINE | ID: mdl-32246826

ABSTRACT

MOTIVATION: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. RESULTS: We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position. AVAILABILITY AND IMPLEMENTATION: GenMap can be installed via bioconda. Binaries and C++ source code are available on https://github.com/cpockrandt/genmap.


Subject(s)
Genome , Software , Algorithms , Genomics , Sequence Analysis, DNA
3.
Algorithms Mol Biol ; 12: 5, 2017.
Article in English | MEDLINE | ID: mdl-28293277

ABSTRACT

BACKGROUND: The deviation of the observed frequency of a word w from its expected frequency in a given sequence x is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the deviation of w, denoted by [Formula: see text], effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word w of length [Formula: see text] is a [Formula: see text]-avoided word in x if [Formula: see text], for a given threshold [Formula: see text]. Notice that such a word may be completely absent from x. Hence, computing all such words naïvely can be a very time-consuming procedure, in particular for large k. RESULTS: In this article, we propose an [Formula: see text]-time and [Formula: see text]-space algorithm to compute all [Formula: see text]-avoided words of length k in a given sequence of length n over a fixed-sized alphabet. We also present a time-optimal [Formula: see text]-time algorithm to compute all [Formula: see text]-avoided words (of any length) in a sequence of length n over an integer alphabet of size [Formula: see text]. In addition, we provide a tight asymptotic upper bound for the number of [Formula: see text]-avoided words over an integer alphabet and the expected length of the longest one. We make available an implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency and applicability of our implementation in biological sequence analysis. CONCLUSIONS: The systematic search for avoided words is particularly useful for biological sequence analysis. We present a linear-time and linear-space algorithm for the computation of avoided words of length k in a given sequence x. We suggest a modification to this algorithm so that it computes all avoided words of x, irrespective of their length, within the same time complexity. We also present combinatorial results with regards to avoided words and absent words.

5.
Algorithms Mol Biol ; 11: 21, 2016.
Article in English | MEDLINE | ID: mdl-27471546

ABSTRACT

[This corrects the article DOI: 10.1186/s13015-016-0076-6.].

6.
FEBS Lett ; 590(14): 2297-306, 2016 07.
Article in English | MEDLINE | ID: mdl-27279593

ABSTRACT

This work investigates the role of isochores during preimplantation process. Using RNA-seq data from human and mouse preimplantation stages, we created the spatio-temporal transcriptional profiles of the isochores during preimplantation. We found that from early to late stages, GC-rich isochores increase their expression while GC-poor ones decrease it. Network analysis revealed that modules with few coexpressed isochores are GC-poorer than medium-large ones, characterized by an opposite expression as preimplantation advances, decreasing and increasing respectively. Our results reveal a functional contribution of the isochores, supporting the presence of structural-functional interactions during maturation and early-embryonic development.


Subject(s)
Blastocyst/metabolism , Gene Expression Regulation, Developmental/physiology , Isochores/metabolism , Transcriptome/physiology , Animals , Humans , Mice , Species Specificity
7.
Algorithms Mol Biol ; 11: 12, 2016.
Article in English | MEDLINE | ID: mdl-27168761

ABSTRACT

BACKGROUND: Sequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. RESULTS: In this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.

8.
IEEE Trans Nanobioscience ; 15(2): 93-100, 2016 03.
Article in English | MEDLINE | ID: mdl-26992174

ABSTRACT

This paper deals with the approximate version of the circular pattern matching (ACPM) problem, which appears as an interesting problem in many biological contexts. The circular pattern matching problem consists in finding all occurrences of the rotations of a pattern P of length m in a text T of length n. In ACPM, we consider occurrences with k -mismatches under the Hamming distance model. In this paper, we present a simple and fast filter-based algorithm to solve the ACPM problem. We compare our algorithm with the state of the art algorithms and the results are found to be excellent. In particular, our algorithm runs almost twice as fast than the state of the art. Much of the efficiency of our algorithm can be attributed to its filters that are effective but extremely simple and lightweight.


Subject(s)
Algorithms , Computational Biology/methods , DNA, Circular , Pattern Recognition, Automated/methods , Sequence Analysis, DNA/methods , DNA, Circular/analysis , DNA, Circular/chemistry , DNA, Circular/genetics
9.
Int J Genomics ; 2015: 259320, 2015.
Article in English | MEDLINE | ID: mdl-26557649

ABSTRACT

This paper deals with the circular pattern matching (CPM) problem, which appears as an interesting problem in many biological contexts. CPM consists in finding all occurrences of the rotations of a pattern 𝒫 of length m in a text 𝒯 of length n. In this paper, we present SimpLiFiCPM (pronounced "Simplify CPM"), a simple and lightweight filter-based algorithm to solve the problem. We compare our algorithm with the state-of-the-art algorithms and the results are found to be excellent. Much of the speed of our algorithm comes from the fact that our filters are effective but extremely simple and lightweight.

10.
Algorithms Mol Biol ; 9: 21, 2014.
Article in English | MEDLINE | ID: mdl-25221616

ABSTRACT

BACKGROUND: Tandem duplication, in the context of molecular biology, occurs as a result of mutational events in which an original segment of DNA is converted into a sequence of individual copies. More formally, a repetition or tandem repeat in a string of letters consists of exact concatenations of identical factors of the string. Biologists are interested in approximate tandem repeats and not necessarily only in exact tandem repeats. A weighted sequence is a string in which a set of letters may occur at each position with respective probabilities of occurrence. It naturally arises in many biological contexts and provides a method to realise the approximation among distinct adjacent occurrences of the same DNA segment. RESULTS: Crochemore's repetitions algorithm, also referred to as Crochemore's partitioning algorithm, was introduced in 1981, and was the first optimal [Formula: see text]-time algorithm to compute all repetitions in a string of length n. In this article, we present a novel variant of Crochemore's partitioning algorithm for weighted sequences, which requires optimal [Formula: see text] time, thus improving on the best known [Formula: see text]-time algorithm (Zhang et al., 2013) for computing all repetitions in a weighted sequence of length n.

11.
Algorithms Mol Biol ; 9(1): 9, 2014 Mar 22.
Article in English | MEDLINE | ID: mdl-24656145

ABSTRACT

BACKGROUND: Circular string matching is a problem which naturally arises in many biological contexts. It consists in finding all occurrences of the rotations of a pattern of length m in a text of length n. There exist optimal average-case algorithms for exact circular string matching. Approximate circular string matching is a rather undeveloped area. RESULTS: In this article, we present a suboptimal average-case algorithm for exact circular string matching requiring time O(n). Based on our solution for the exact case, we present two fast average-case algorithms for approximate circular string matching with k-mismatches, under the Hamming distance model, requiring time O(n) for moderate values of k, that is k=O(m/logm). We show how the same results can be easily obtained under the edit distance model. The presented algorithms are also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naïve approach. CONCLUSIONS: We present two fast average-case algorithms for approximate circular string matching with k-mismatches; and show that they also perform very well in practice. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any biological pipeline. The source code of the library is freely available at http://www.inf.kcl.ac.uk/research/projects/asmf/.

12.
Genomics ; 102(4): 223-8, 2013 Oct.
Article in English | MEDLINE | ID: mdl-23831115

ABSTRACT

The study of DNA sequence variation has been transformed by recent advances in DNA sequencing technologies. Determination of the functional consequences of sequence variant alleles offers potential insight as to how genotype may influence phenotype. Even within protein coding regions of the genome, establishing the consequences of variation on gene and protein function is challenging and requires substantial laboratory investigation. However, a series of bioinformatics tools have been developed to predict whether non-synonymous variants are neutral or disease-causing. In this study we evaluate the performance of nine such methods (SIFT, PolyPhen2, SNPs&GO, PhD-SNP, PANTHER, Mutation Assessor, MutPred, Condel and CAROL) and developed CoVEC (Consensus Variant Effect Classification), a tool that integrates the prediction results from four of these methods. We demonstrate that the CoVEC approach outperforms most individual methods and highlights the benefit of combining results from multiple tools.


Subject(s)
Base Sequence , Computational Biology/methods , Genetic Variation , Algorithms , Animals , Genome , Genotype , Humans , Open Reading Frames , Phenotype , Polymorphism, Single Nucleotide
13.
BMC Bioinformatics ; 14 Suppl 8: S2, 2013.
Article in English | MEDLINE | ID: mdl-23815711

ABSTRACT

A weighted biological sequence is a string in which a set of characters may appear at each position with respective probabilities of occurrence. We attempt to locate all the tandem repeats in a weighted sequence. A repeated substring is called a tandem repeat if each occurrence of the substring is directly adjacent to each other. By introducing the idea of equivalence classes in weighted sequences, we identify the tandem repeats of every possible length using an iterative partitioning technique. We also present the algorithm for recording the tandem repeats, and prove that the problem can be solved in O(n²) time.


Subject(s)
Algorithms , Sequence Alignment/methods , Tandem Repeat Sequences , Proteins/chemistry , Proteins/genetics
14.
Int J Comput Biol Drug Des ; 6(1-2): 119-30, 2013.
Article in English | MEDLINE | ID: mdl-23428478

ABSTRACT

In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S(0), S(1), …, S(k), of sequences of equal length, where S(i), for all 1≤i≤k, differs from S(0) by a constant number of errors - around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k - the number of sequences.


Subject(s)
Algorithms , Sequence Analysis, DNA/methods , Sequence Homology, Nucleic Acid , Computational Biology , High-Throughput Nucleotide Sequencing
15.
Recent Pat DNA Gene Seq ; 7(2): 84-95, 2013 Aug.
Article in English | MEDLINE | ID: mdl-22974258

ABSTRACT

MOTIVATION: Pairwise sequence alignment has received a new motivation due to the advent of recent patents in next-generation sequencing technologies, particularly so for the application of re-sequencing---the assembly of a genome directed by a reference sequence. After the fast alignment between a factor of the reference sequence and a high-quality fragment of a short read by a short-read alignment programme, an important problem is to find the alignment between a relatively short succeeding factor of the reference sequence and the remaining low-quality part of the read allowing a number of mismatches and the insertion of a single gap in the alignment. RESULTS: We present GapMis, a tool for pairwise sequence alignment with a single gap. It is based on a simple algorithm, which computes a different version of the traditional dynamic programming matrix. The presented experimental results demonstrate that GapMis is more suitable and efficient than most popular tools for this task.


Subject(s)
Algorithms , Software , High-Throughput Nucleotide Sequencing , Patents as Topic , Sequence Alignment
16.
Genomics ; 101(2): 120-4, 2013 Feb.
Article in English | MEDLINE | ID: mdl-23195409

ABSTRACT

Several studies on adult tissues agree on the presence of a positive effect of the genomic and genic base composition on mammalian gene expression. Recent literature supports the idea that during developmental processes GC-poor genomic regions are preferentially implicated. We investigate the relationship between the compositional properties of the isochores and of the genes with their respective expression activity during developmental processes. Using RNA-seq data from two distinct developmental stages of the mouse cortex, embryonic day 18 (E18) and postnatal day 7 (P7), we established for the first time a developmental-related transcriptome map of the mouse isochores. Additionally, for each stage we estimated the correlation between isochores' GC level and their expression activity, and the genes' expression patterns for each isochore family. Our analyses add evidence supporting the idea that during development GC-poor isochores are preferentially implicated, and confirm the positive effect of genes' GC level on their expression activity.


Subject(s)
Brain/metabolism , Gene Expression Regulation, Developmental , Isochores/genetics , Transcriptome , Animals , Base Composition , Brain/embryology , Chromosome Mapping , Embryo, Mammalian , Gene Library , Mice , Sequence Analysis, RNA
17.
BMC Genomics ; 12: 511, 2011 Oct 17.
Article in English | MEDLINE | ID: mdl-22004510

ABSTRACT

BACKGROUND: The availability of fully sequenced genomes and the implementation of transcriptome technologies have increased the studies investigating the expression profiles for a variety of tissues, conditions, and species. In this study, using RNA-seq data for three distinct tissues (brain, liver, and muscle), we investigate how base composition affects mammalian gene expression, an issue of prime practical and evolutionary interest. RESULTS: We present the transcriptome map of the mouse isochores (DNA segments with a fairly homogeneous base composition) for the three different tissues and the effects of isochores' base composition on their expression activity. Our analyses also cover the relations between the genes' expression activity and their localization in the isochore families. CONCLUSIONS: This study is the first where next-generation sequencing data are used to associate the effects of both genomic and genic compositional properties to their corresponding expression activity. Our findings confirm previous results, and further support the existence of a relationship between isochores and gene expression. This relationship corroborates that isochores are primarily a product of evolutionary adaptation rather than a simple by-product of neutral evolutionary processes.


Subject(s)
Isochores/genetics , Transcriptome , Animals , Base Composition , Brain/metabolism , Genome , Liver/metabolism , Mice , Mice, Inbred C57BL , Muscles/metabolism , Sequence Analysis, RNA
18.
Adv Exp Med Biol ; 680: 399-403, 2010.
Article in English | MEDLINE | ID: mdl-20865524

ABSTRACT

Novel high-throughput (Deep) sequencing technology methods have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads) in a single experiment and with a much lower cost than previous sequencing methods. In this paper, we present a new algorithm for addressing the problem of efficiently mapping millions of short reads to a reference genome. In particular, we define and solve the Massive Approximate Pattern Matching problem for mapping short sequences to a reference genome.


Subject(s)
Algorithms , Chromosome Mapping/statistics & numerical data , Genomics/statistics & numerical data , Sequence Alignment/statistics & numerical data , Animals , Computational Biology , Mice , Pattern Recognition, Automated/statistics & numerical data , RNA/genetics , X Chromosome/genetics
19.
Protein Pept Lett ; 17(9): 1136-42, 2010 Sep.
Article in English | MEDLINE | ID: mdl-20509856

ABSTRACT

A weighted sequence is a string in which a set of characters may appear at each position with respective probabilities of occurrence. Weighted sequences are able to summarize poorly defined short sequences, as well as the profiles of protein families and complete chromosome sequences. Thus it is of biological and theoretical significance to design powerful algorithms on weighted sequences. A common task is to identify repetitive motifs in weighted sequences, with presence probability not less than a given threshold. We define two types of repeats in weighted sequences, called the loose repeats and the strict repeats, respectively, and then attempt to locate these repeats. Using an iterative partitioning technique, we present algorithms for computing all the loose repeats and strict repeats of every length, respectively. Each solution costs O(n(2)) time.


Subject(s)
Computational Biology/methods , Proteins/chemistry , Algorithms , Repetitive Sequences, Amino Acid
20.
Int J Comput Biol Drug Des ; 2(4): 385-97, 2009.
Article in English | MEDLINE | ID: mdl-20090178

ABSTRACT

Novel high-throughput (Deep) sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences in a single experiment and with a much lower cost than previous methods. In this paper, we address the problem of efficiently mapping and classifying millions of short sequences to a reference genome, based on whether they occur exactly once in the genome or not, and by taking into consideration probability scores. In particular, we design algorithms for Massive Exact and Approximate Pattern Matching of short degenerate and weighted sequences, derived from Deep sequencing technologies, to a reference genome.


Subject(s)
Algorithms , Chromosome Mapping/methods , Genome , Sequence Analysis, DNA/methods , Base Sequence , Computational Biology , High-Throughput Screening Assays
SELECTION OF CITATIONS
SEARCH DETAIL
...