Search | VHL Regional Portal

Complementarity of assembly-first and mapping-first approaches for alternative splicing annotation and differential analysis from RNAseq data.

Benoit-Pilven, Clara; Marchet, Camille; Chautard, Emilie; Lima, Leandro; Lambert, Marie-Pierre; Sacomoto, Gustavo; Rey, Amandine; Cologne, Audric; Terrone, Sophie; Dulaurier, Louis; Claude, Jean-Baptiste; Bourgeois, Cyril F; Auboeuf, Didier; Lacroix, Vincent.

Sci Rep ; 8(1): 4307, 2018 03 09.

Article in English | MEDLINE | ID: mdl-29523794

ABSTRACT

Genome-wide analyses estimate that more than 90% of multi exonic human genes produce at least two transcripts through alternative splicing (AS). Various bioinformatics methods are available to analyze AS from RNAseq data. Most methods start by mapping the reads to an annotated reference genome, but some start by a de novo assembly of the reads. In this paper, we present a systematic comparison of a mapping-first approach (FARLINE) and an assembly-first approach (KISSPLICE). We applied these methods to two independent RNAseq datasets and found that the predictions of the two pipelines overlapped (70% of exon skipping events were common), but with noticeable differences. The assembly-first approach allowed to find more novel variants, including novel unannotated exons and splice sites. It also predicted AS in recently duplicated genes. The mapping-first approach allowed to find more lowly expressed splicing variants, and splice variants overlapping repeats. This work demonstrates that annotating AS with a single approach leads to missing out a large number of candidates, many of which are differentially regulated across conditions and can be validated experimentally. We therefore advocate for the combined use of both mapping-first and assembly-first approaches for the annotation and differential analysis of AS from RNAseq datasets.

Subject(s)

Alternative Splicing , Sequence Analysis, RNA/methods , Software , Humans , RNA Splice Sites , Sequence Analysis, RNA/standards

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads.

Lima, Leandro; Sinaimeri, Blerina; Sacomoto, Gustavo; Lopez-Maestre, Helene; Marchet, Camille; Miele, Vincent; Sagot, Marie-France; Lacroix, Vincent.

Algorithms Mol Biol ; 12: 2, 2017.

Article in English | MEDLINE | ID: mdl-28250805

ABSTRACT

BACKGROUND: The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. RESULTS: The results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99-111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644-652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086-1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134-1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods.

Colib'read on galaxy: a tools suite dedicated to biological information extraction from raw NGS reads.

Le Bras, Yvan; Collin, Olivier; Monjeaud, Cyril; Lacroix, Vincent; Rivals, Éric; Lemaitre, Claire; Miele, Vincent; Sacomoto, Gustavo; Marchet, Camille; Cazaux, Bastien; Zine El Aabidine, Amal; Salmela, Leena; Alves-Carvalho, Susete; Andrieux, Alexan; Uricaru, Raluca; Peterlongo, Pierre.

Gigascience ; 5: 9, 2016.

Article in English | MEDLINE | ID: mdl-26870323

ABSTRACT

BACKGROUND: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. FINDINGS: Dedicated to 'whole-genome assembly-free' treatments, the Colib'read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. CONCLUSIONS: With the Colib'read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.

Subject(s)

Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods , Information Storage and Retrieval/methods , Software , Base Sequence , Cluster Analysis , Genome/genetics , Genomics/methods , Molecular Sequence Data , Reproducibility of Results

A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs.

Sacomoto, Gustavo; Lacroix, Vincent; Sagot, Marie-France.

Algorithms Mol Biol ; 10: 20, 2015.

Article in English | MEDLINE | ID: mdl-26120359

ABSTRACT

BACKGROUND: The problem of enumerating bubbles with length constraints in directed graphs arises in transcriptomics where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq. RESULTS: We present a new algorithm for enumerating bubbles with length constraints in weighted directed graphs. This is the first polynomial delay algorithm for this problem and we show that in practice, it is faster than previous approaches. CONCLUSION: This settles one of the main open questions from Sacomoto et al. (BMC Bioinform 13:5, 2012). Moreover, the new algorithm allows us to deal with larger instances and possibly detect longer alternative splicing events.

Using cascading Bloom filters to improve the memory usage for de Brujin graphs.

Salikhov, Kamil; Sacomoto, Gustavo; Kucherov, Gregory.

Algorithms Mol Biol ; 9(1): 2, 2014 Feb 24.

Article in English | MEDLINE | ID: mdl-24565280

ABSTRACT

BACKGROUND: De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. RESULTS: In this work, we show how to reduce the memory required by the data structure of Chikhi and Rizk (WABI'12) that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to their method, with insignificant impact on construction time. At the same time, our experiments showed a better query time compared to the method of Chikhi and Rizk. CONCLUSION: The proposed data structure constitutes, to our knowledge, currently the most efficient practical representation of de Bruijn graphs.

KISSPLICE: de-novo calling alternative splicing events from RNA-seq data.

Sacomoto, Gustavo A T; Kielbassa, Janice; Chikhi, Rayan; Uricaru, Raluca; Antoniou, Pavlos; Sagot, Marie-France; Peterlongo, Pierre; Lacroix, Vincent.

BMC Bioinformatics ; 13 Suppl 6: S5, 2012 Apr 19.

Article in English | MEDLINE | ID: mdl-22537044

ABSTRACT

BACKGROUND: In this paper, we address the problem of identifying and quantifying polymorphisms in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each polymorphism corresponds to a recognisable pattern in a De Bruijn graph constructed from the RNA-seq reads, we propose a general model for all polymorphisms in such graphs. We then introduce an exact algorithm, called KISSPLICE, to extract alternative splicing events. RESULTS: We show that KISSPLICE enables to identify more correct events than general purpose transcriptome assemblers. Additionally, on a 71 M reads dataset from human brain and liver tissues, KISSPLICE identified 3497 alternative splicing events, out of which 56% are not present in the annotations, which confirms recent estimates showing that the complexity of alternative splicing has been largely underestimated so far. CONCLUSIONS: We propose new models and algorithms for the detection of polymorphism in RNA-seq data. This opens the way to a new kind of studies on large HTS RNA-seq datasets, where the focus is not the global reconstruction of full-length transcripts, but local assembly of polymorphic regions. KISSPLICE is available for download at http://alcovna.genouest.org/kissplice/.

Subject(s)

Algorithms , Alternative Splicing , Models, Statistical , Sequence Analysis, RNA , Genome , Humans , Polymorphism, Single Nucleotide , Reference Standards , Tandem Repeat Sequences , Transcriptome

Lossless filter for multiple repeats with bounded edit distance.

Peterlongo, Pierre; Sacomoto, Gustavo Akio Tominaga; do Lago, Alair Pereira; Pisanti, Nadia; Sagot, Marie-France.

Algorithms Mol Biol ; 4: 3, 2009 Jan 30.

Article in English | MEDLINE | ID: mdl-19183438

ABSTRACT

BACKGROUND: Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice. RESULTS: The filter TUIUIU that we introduce in this paper provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment or repeats inference method, eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate repeat. It consists in the verification of several strong necessary conditions that can be checked in a fast way. We implemented three versions of the filter. The first is simply a straightforward extension to the case of multiple sequences of an application of conditions already existing in the literature. The second uses a stronger condition which, as our results show, enable to filter sensibly more with negligible (if any) additional time. The third version uses an additional condition and pushes the sensibility of the filter even further with a non negligible additional time in many circumstances; our experiments show that it is particularly useful with large error rates. The latter version was applied as a preprocessing of a multiple alignment tool, obtaining an overall time (filter plus alignment) on average 63 and at best 530 times smaller than before (direct alignment), with in most cases a better quality alignment. CONCLUSION: To the best of our knowledge, TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL