Search | VHL Regional Portal

Easy identification of generalized common and conserved nested intervals.

de Montgolfier, Fabien; Raffinot, Mathieu; Rusu, Irena.

J Comput Biol ; 21(7): 520-33, 2014 Jul.

Article in English | MEDLINE | ID: mdl-24650221

ABSTRACT

In this article we explain how to easily compute gene clusters, formalized by classical or generalized nested common or conserved intervals, between a set of K genomes represented as K permutations. A b-nested common (resp. conserved) interval I of size |I| is either an interval of size 1 or a common (resp. conserved) interval that contains another b-nested common (resp. conserved) interval of size at least |I|-b. When b=1, this corresponds to the classical notion of nested interval. We exhibit two simple algorithms to output all b-nested common or conserved intervals between K permutations in O(Kn+nocc) time, where nocc is the total number of such intervals. We also explain how to count all b-nested intervals in O(Kn) time. New properties of the family of conserved intervals are proposed to do so.

Subject(s)

Algorithms , Computational Biology/methods , Genomics/methods , Multigene Family

On the identification of conflicting contiguities in ancestral genome reconstruction.

Ouangraoua, Aïda; Raffinot, Mathieu.

J Comput Biol ; 21(1): 64-79, 2014 Jan.

Article in English | MEDLINE | ID: mdl-24106802

ABSTRACT

In computional biology, up-to-date homology-based methods for the reconstruction of ancestral gene orders usually rely on two phases. First, potential ancestral co-localizations of some genomic markers are detected from homologies between extant species. Next, the assembling phase mainly consists in resolving the conflicts between the potential ancestral features. This can be done using many methods, but one of the most advanced solutions is to identify and discard from the set of potential features those that belong to inclusivewise minimal conflicting sets of features. It relies on the consecutive ones property (C1P), and the notion of minimal conflicting set (MCS), widely used in physical mapping problems. Let C be a finite set of n elements and R= {r(1)' r(2)' . . . ' r(m)} a family of m subsets of C. A subset X of R satisfies the C1P if there exists a permutation P of C such that each r(i) in X is an interval of P. An MCS S â R is a subset of R that does not satisfy the C1P, but such that any of its proper subsets does. In this article, we present a new simpler and faster algorithm to decide if a given element r ∈ R belongs to at least one MCS. Our algorithm runs in O(n(2)m(2) + nm(7)), largely improving upon the current O(m(6)n(5)(m + n)(2) log(m + n)) fastest algorithm. The new algorithm is based on an alternative approach considering minimal forbidden induced subgraphs of interval graphs instead of Tucker matrices.

Subject(s)

Evolution, Molecular , Genomics/statistics & numerical data , Algorithms , Computational Biology , Models, Genetic

Identification of genomic features using microsyntenies of domains: domain teams.

Pasek, Sophie; Bergeron, Anne; Risler, Jean-Loup; Louis, Alexandra; Ollivier, Emmanuelle; Raffinot, Mathieu.

Genome Res ; 15(6): 867-74, 2005 Jun.

Article in English | MEDLINE | ID: mdl-15899966

ABSTRACT

The detection, across several genomes, of local conservation of gene content and proximity considerably helps the prediction of features of interest, such as gene fusions or physical and functional interactions. Here, we want to process realistic models of chromosomes, in which genes (or genomic segments of several genes) can be duplicated within a chromosome, or be absent from some other chromosome(s). Our approach adopts the technique of temporarily forgetting genes and working directly with protein "domains" such as those found in Pfam. This allows the detection of strings of domains that are conserved in their content, but not necessarily in their order, which we refer to as domain teams. The prominent feature of the method is that it relaxes the rigidity of the orthology criterion and avoids many of the pitfalls of gene-families identification methods, often hampered by multidomain proteins or low levels of sequence similarity. This approach, that allows both inter- and intrachromosomal comparisons, proves to be more sensitive than the classical methods based on pairwise sequence comparisons, particularly in the simultaneous treatment of many species. The automated and fast detection of domain teams, together with its increased sensitivity at identifying segments of identical (protein-coding) gene contents as well as gene fusions, should prove a useful complement to other existing methods.

Subject(s)

Bacterial Proteins/genetics , Genome, Bacterial , Gram-Negative Bacteria/genetics , Sequence Analysis, Protein/methods , Synteny , Chromosomes, Bacterial/genetics , Databases, Protein , Protein Structure, Tertiary/genetics

Approximate matching of structured motifs in DNA sequences.

El-Mabrouk, Nadia; Raffinot, Mathieu; Duchesne, Jean-Eudes; Lajoie, Mathieu; Luc, Nicolas.

J Bioinform Comput Biol ; 3(2): 317-42, 2005 Apr.

Article in English | MEDLINE | ID: mdl-15852508

ABSTRACT

Several methods have been developed for identifying more or less complex RNA structures in a genome. All these methods are based on the search for conserved primary and secondary sub-structures. In this paper, we present a simple formal representation of a helix, which is a combination of sequence and folding constraints, as a constrained regular expression. This representation allows us to develop a well-founded algorithm that searches for all approximate matches of a helix in a genome. The algorithm is based on an alignment graph constructed from several copies of a pushdown automaton, arranged one on top of another. This is a first attempt to take advantage of the possibilities of pushdown automata in the context of approximate matching. The worst time complexity is O(krpn), where k is the error threshold, n the size of the genome, p the size of the secondary expression, and r its number of union symbols. We then extend the algorithm to search for pseudo-knots and secondary structures containing an arbitrary number of helices.

Subject(s)

Algorithms , Artificial Intelligence , Pattern Recognition, Automated/methods , RNA, Transfer/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Sequence Analysis, RNA/methods , Consensus Sequence , Sequence Homology, Amino Acid

Gene teams: a new formalization of gene clusters for comparative genomics.

Luc, Nicolas; Risler, Jean-Loup; Bergeron, Anne; Raffinot, Mathieu.

Comput Biol Chem ; 27(1): 59-67, 2003 Feb.

Article in English | MEDLINE | ID: mdl-12798040

ABSTRACT

This paper describes an efficient algorithm based on a new concept called gene team for detecting conserved gene clusters among an arbitrary number of chromosomes. Within the clusters, neither the order of the genes nor their orientation need be conserved. In addition, insertion of foreign genes within the clusters are permitted to a user-defined extent. This algorithm has been implemented in a publicly available TEAM software that proves to be an efficient tool for systematic searches of conserved gene clusters. Examples of actual biological results are provided. The software is downloadable from http://www-igm.univ-mlv.fr/ approximately raffinot/geneteam.html.

Subject(s)

Genomics/methods , Multigene Family/genetics , Algorithms , Conserved Sequence/genetics , Genes, Bacterial/genetics , Genome, Bacterial , Models, Genetic , Software

Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching.

Navarro, Gonzalo; Raffinot, Mathieu.

J Comput Biol ; 10(6): 903-23, 2003.

Article in English | MEDLINE | ID: mdl-14980017

ABSTRACT

The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] - x(2,3) - [DE] - x(2,3) - Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters, and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a priori the fastest between both. We also show how to search permitting a few mistakes in the occurrences. We performed many practical experiments using the PROSITE database, and all of them show that our algorithms are the fastest in virtually all cases.

Subject(s)

Computational Biology/methods , Pattern Recognition, Automated , Proteins/chemistry , Proteins/classification , Algorithms , Information Storage and Retrieval

High similarity sequence comparison in clustering large sequence databases.

Dudoignon, Lorie; Glemet, Eric; Heus, Hendrik Cornelis; Raffinot, Mathieu.

Proc IEEE Comput Soc Bioinform Conf ; 1: 228-36, 2002.

Article in English | MEDLINE | ID: mdl-15838139

ABSTRACT

We present a fast algorithm for sequence clustering and searching which works with large sequence databases. It uses a strictly defined similarity measure. The algorithm is faster than conventional EST clustering approaches because its complexity is directly related to the number of subwords shared by the sequences. Furthermore, the algorithm also works with proteic sequences and large sequences like entire chromosomes. We present a theoretical study of our approach and provide experimental results.

Subject(s)

Algorithms , Chromosome Mapping/methods , Database Management Systems , Databases, Genetic , Pattern Recognition, Automated/methods , Sequence Alignment/methods , Sequence Analysis/methods , Cluster Analysis , Information Storage and Retrieval/methods

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL