Search | VHL Regional Portal

Hybrid de novo tandem repeat detection using short and long reads.

Fertin, Guillaume; Jean, Géraldine; Radulescu, Andreea; Rusu, Irena.

BMC Med Genomics ; 8 Suppl 3: S5, 2015.

Article in English | MEDLINE | ID: mdl-26399998

ABSTRACT

BACKGROUND: As one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases. Many methods designed for tandem repeat detection on reference sequences obtain high quality results. However, in the case of a de novo context, where no reference sequence is available, tandem repeat detection remains a difficult problem. The short reads obtained with the second-generation sequencing methods are not long enough to span regions that contain long repeats. This length limitation was tackled by the long reads obtained with the third-generation sequencing platforms such as Pacific Biosciences technologies. Nevertheless, the gain on the read length came with a significant increase of the error rate. The main objective of nowadays studies on long reads is to handle the high error rate up to 16%. METHODS: In this paper we present MixTaR, the first de novo method for tandem repeat detection that combines the high-quality of short reads and the large length of long reads. Our hybrid algorithm uses the set of short reads for tandem repeat pattern detection based on a de Bruijn graph. These patterns are then validated using the long reads, and the tandem repeat sequences are constructed using local greedy assemblies. RESULTS: MixTaR is tested with both simulated and real reads from complex organisms. For a complete analysis of its robustness to errors, we use short and long reads with different error rates. The results are then analysed in terms of number of tandem repeats detected and the length of their patterns. CONCLUSIONS: Our method shows high precision and sensitivity. With low false positive rates even for highly erroneous reads, MixTaR is able to detect accurate tandem repeats with pattern lengths varying within a significant interval.

Subject(s)

Algorithms , Tandem Repeat Sequences/genetics , Animals , Caenorhabditis elegans/genetics , Chromosomes , Genome, Bacterial , Legionella pneumophila/genetics

Easy identification of generalized common and conserved nested intervals.

de Montgolfier, Fabien; Raffinot, Mathieu; Rusu, Irena.

J Comput Biol ; 21(7): 520-33, 2014 Jul.

Article in English | MEDLINE | ID: mdl-24650221

ABSTRACT

In this article we explain how to easily compute gene clusters, formalized by classical or generalized nested common or conserved intervals, between a set of K genomes represented as K permutations. A b-nested common (resp. conserved) interval I of size |I| is either an interval of size 1 or a common (resp. conserved) interval that contains another b-nested common (resp. conserved) interval of size at least |I|-b. When b=1, this corresponds to the classical notion of nested interval. We exhibit two simple algorithms to output all b-nested common or conserved intervals between K permutations in O(Kn+nocc) time, where nocc is the total number of such intervals. We also explain how to count all b-nested intervals in O(Kn) time. New properties of the family of conserved intervals are proposed to do so.

Subject(s)

Algorithms , Computational Biology/methods , Genomics/methods , Multigene Family

Homogeneous decomposition of protein interaction networks: refining the description of intra-modular interactions.

Del Mondo, Géraldine; Eveillard, Damien; Rusu, Irena.

Bioinformatics ; 25(7): 926-32, 2009 Apr 01.

Article in English | MEDLINE | ID: mdl-19223451

ABSTRACT

MOTIVATION: Modules in biology appeared quickly as an accurate way for summarizing complex living systems by simple ones. Therefore, finding an appropriate relationship between modules extracted from a biological graph and protein complexes remains a crucial task. Recent studies successfully proposed various descriptions of protein interaction networks. These approaches succeed in showing modules within the network and how the modules interact. However, describing the interactions within the modules, i.e. intra-modular interactions, remains little analyzed despite its interest for understanding module functions. RESULTS: We overcome this weakness by adding a complementary description to the already successful approaches: a hierarchical decomposition named homogeneous decomposition. This decomposition represents a natural refinement of previous analyses and details interactions within a module. We propose to illustrate these improvements by three practical cases. Among them, we decompose the yeast protein interaction network and show reachable biological insights that might be extracted from a complex large-scale network. AVAILABILITY: A program is at disposal under CeCILL license at: www.lina.univ-nantes.fr/combi/DH/Home.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Multiprotein Complexes/metabolism , Protein Interaction Mapping/methods , Computational Biology , Databases, Protein , Saccharomyces cerevisiae/metabolism , Software

Efficient tools for computing the number of breakpoints and the number of adjacencies between two genomes with duplicate genes.

Angibaud, Sébastien; Fertin, Guillaume; Rusu, Irena; Thévenin, Annelyse; Vialette, Stéphane.

J Comput Biol ; 15(8): 1093-115, 2008 Oct.

Article in English | MEDLINE | ID: mdl-18774903

ABSTRACT

Comparing genomes of different species is a fundamental problem in comparative genomics. Recent research has resulted in the introduction of different measures between pairs of genomes: for example, reversal distance, number of breakpoints, and number of common or conserved intervals. However, classical methods used for computing such measures are seriously compromised when genomes have several copies of the same gene scattered across them. Most approaches to overcome this difficulty are based either on the exemplar model, which keeps exactly one copy in each genome of each duplicated gene, or on the maximum matching model, which keeps as many copies as possible of each duplicated gene. The goal is to find an exemplar matching, respectively a maximum matching, that optimizes the studied measure. Unfortunately, it turns out that, in presence of duplications, this problem for each above-mentioned measure is NP-hard. In this paper, we propose to compute the minimum number of breakpoints and the maximum number of adjacencies between two genomes in presence of duplications using two different approaches. The first one is an exact, generic 0-1 linear programming approach, while the second is a collection of three heuristics. Each of these approaches is applied on each problem and for each of the following models: exemplar, maximum matching and intermediate model, that we introduce here. All these programs are run on a well-known public benchmark dataset of gamma-Proteobacteria, and their performances are discussed.

Subject(s)

Computational Biology/methods , Genes, Duplicate , Genome, Bacterial/genetics , Genomics/methods , Algorithms , Gammaproteobacteria/genetics , Models, Genetic , Multigene Family

A pseudo-boolean framework for computing rearrangement distances between genomes with duplicates.

Angibaud, Sébastien; Fertin, Guillaume; Rusu, Irena; Vialette, Stéphane.

J Comput Biol ; 14(4): 379-93, 2007 May.

Article in English | MEDLINE | ID: mdl-17572018

ABSTRACT

Computing genomic distances between whole genomes is a fundamental problem in comparative genomics. Recent researches have resulted in different genomic distance definitions, for example, number of breakpoints, number of common intervals, number of conserved intervals, and Maximum Adjacency Disruption number. Unfortunately, it turns out that, in presence of duplications, most problems are NP-hard, and hence several heuristics have been recently proposed. However, while it is relatively easy to compare heuristics between them, until now very little is known about the absolute accuracy of these heuristics. Therefore, there is a great need for algorithmic approaches that compute exact solutions for these genomic distances. In this paper, we present a novel generic pseudo-boolean approach for computing the exact genomic distance between two whole genomes in presence of duplications, and put strong emphasis on common intervals under the maximum matching model. Of particular importance, we show three heuristics which provide very good results on a well-known public dataset of gamma-Proteobacteria.

Subject(s)

Algorithms , Gammaproteobacteria/genetics , Gene Duplication , Genome, Bacterial , Software , Computational Biology , Sequence Analysis, DNA

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL