Search | VHL Regional Portal

1.

Optimizing hybrid ensemble feature selection strategies for transcriptomic biomarker discovery in complex diseases.

Claude, Elsa; Leclercq, Mickaël; Thébault, Patricia; Droit, Arnaud; Uricaru, Raluca.

NAR Genom Bioinform ; 6(3): lqae079, 2024 Sep.

Article in English | MEDLINE | ID: mdl-38993634

ABSTRACT

Biomedical research takes advantage of omic data, such as transcriptomics, to unravel the complexity of diseases. A conventional strategy identifies transcriptomic biomarkers characterized by expression patterns associated with a phenotype by relying on feature selection approaches. Hybrid ensemble feature selection (HEFS) has become increasingly popular as it ensures robustness of the selected features by performing data and functional perturbations. However, it remains difficult to make the best suited choices at each step when designing such approaches. We conducted an extensive analysis of four possible HEFS scenarios for the identification of Stage IV colorectal, Stage I kidney and lung and Stage III endometrial cancer biomarkers from transcriptomic data. These scenarios investigate the use of two types of feature reduction by filters (differentially expressed genes and variance) conjointly with two types of resampling strategies (repeated holdout by distribution-balanced stratified and random stratified) for downstream feature selection through an aggregation of thousands of wrapped machine learning models. Based on our results, we emphasize the advantages of using HEFS approaches to identify complex disease biomarkers, given their ability to produce generalizable and stable results to both data and functional perturbations. Finally, we highlight critical issues that need to be considered in the design of such strategies.

2.

N-3 PUFA deficiency disrupts oligodendrocyte maturation and myelin integrity during brain development.

Leyrolle, Quentin; Decoeur, Fanny; Dejean, Cyril; Brière, Galadriel; Leon, Stephane; Bakoyiannis, Ioannis; Baroux, Emilie; Sterley, Tony-Lee; Bosch-Bouju, Clémentine; Morel, Lydie; Amadieu, Camille; Lecours, Cynthia; St-Pierre, Marie-Kim; Bordeleau, Maude; De Smedt-Peyrusse, Véronique; Séré, Alexandran; Schwendimann, Leslie; Grégoire, Stephane; Bretillon, Lionel; Acar, Niyazi; Joffre, Corinne; Ferreira, Guillaume; Uricaru, Raluca; Thebault, Patricia; Gressens, Pierre; Tremblay, Marie-Eve; Layé, Sophie; Nadjar, Agnes.

Glia ; 70(1): 50-70, 2022 01.

Article in English | MEDLINE | ID: mdl-34519378

ABSTRACT

Westernization of dietary habits has led to a progressive reduction in dietary intake of n-3 polyunsaturated fatty acids (n-3 PUFAs). Low maternal intake of n-3 PUFAs has been linked to neurodevelopmental disorders, conditions in which myelination processes are abnormal, leading to defects in brain functional connectivity. Only little is known about the role of n-3 PUFAs in oligodendrocyte physiology and white matter development. Here, we show that lifelong n-3 PUFA deficiency disrupts oligodendrocytes maturation and myelination processes during the postnatal period in mice. This has long-term deleterious consequences on white matter organization and hippocampus-prefrontal functional connectivity in adults, associated with cognitive and emotional disorders. Promoting developmental myelination with clemastine, a first-generation histamine antagonist and enhancer of oligodendrocyte precursor cell differentiation, rescues memory deficits in n-3 PUFA deficient animals. Our findings identify a novel mechanism through which n-3 PUFA deficiency alters brain functions by disrupting oligodendrocyte maturation and brain myelination during the neurodevelopmental period.

Subject(s)

Fatty Acids, Omega-3 , Animals , Brain , Mice , Myelin Sheath , Neurogenesis , Oligodendroglia

3.

Consensus clustering applied to multi-omics disease subtyping.

Brière, Galadriel; Darbo, Élodie; Thébault, Patricia; Uricaru, Raluca.

BMC Bioinformatics ; 22(1): 361, 2021 Jul 06.

Article in English | MEDLINE | ID: mdl-34229612

ABSTRACT

BACKGROUND: Facing the diversity of omics data and the difficulty of selecting one result over all those produced by several methods, consensus strategies have the potential to reconcile multiple inputs and to produce robust results. RESULTS: Here, we introduce ClustOmics, a generic consensus clustering tool that we use in the context of cancer subtyping. ClustOmics relies on a non-relational graph database, which allows for the simultaneous integration of both multiple omics data and results from various clustering methods. This new tool conciliates input clusterings, regardless of their origin, their number, their size or their shape. ClustOmics implements an intuitive and flexible strategy, based upon the idea of evidence accumulation clustering. ClustOmics computes co-occurrences of pairs of samples in input clusters and uses this score as a similarity measure to reorganize data into consensus clusters. CONCLUSION: We applied ClustOmics to multi-omics disease subtyping on real TCGA cancer data from ten different cancer types. We showed that ClustOmics is robust to heterogeneous qualities of input partitions, smoothing and reconciling preliminary predictions into high-quality consensus clusters, both from a computational and a biological point of view. The comparison to a state-of-the-art consensus-based integration tool, COCA, further corroborated this statement. However, the main interest of ClustOmics is not to compete with other tools, but rather to make profit from their various predictions when no gold-standard metric is available to assess their significance. AVAILABILITY: The ClustOmics source code, released under MIT license, and the results obtained on TCGA cancer data are available on GitHub: https://github.com/galadrielbriere/ClustOmics .

Subject(s)

Algorithms , Neoplasms , Cluster Analysis , Consensus , Humans , Neoplasms/genetics , Software

4.

MICADo - Looking for Mutations in Targeted PacBio Cancer Data: An Alignment-Free Method.

Rudewicz, Justine; Soueidan, Hayssam; Uricaru, Raluca; Bonnefoi, Hervé; Iggo, Richard; Bergh, Jonas; Nikolski, Macha.

Front Genet ; 7: 214, 2016.

Article in English | MEDLINE | ID: mdl-28008336

ABSTRACT

Targeted sequencing is commonly used in clinical application of NGS technology since it enables generation of sufficient sequencing depth in the targeted genes of interest and thus ensures the best possible downstream analysis. This notwithstanding, the accurate discovery and annotation of disease causing mutations remains a challenging problem even in such favorable context. The difficulty is particularly salient in the case of third generation sequencing technology, such as PacBio. We present MICADo, a de Bruijn graph based method, implemented in python, that makes possible to distinguish between patient specific mutations and other alterations for targeted sequencing of a cohort of patients. MICADo analyses NGS reads for each sample within the context of the data of the whole cohort in order to capture the differences between specificities of the sample with respect to the cohort. MICADo is particularly suitable for sequencing data from highly heterogeneous samples, especially when it involves high rates of non-uniform sequencing errors. It was validated on PacBio sequencing datasets from several cohorts of patients. The comparison with two widely used available tools, namely VarScan and GATK, shows that MICADo is more accurate, especially when true mutations have frequencies close to backgound noise. The source code is available at http://github.com/cbib/MICADo.

5.

Colib'read on galaxy: a tools suite dedicated to biological information extraction from raw NGS reads.

Le Bras, Yvan; Collin, Olivier; Monjeaud, Cyril; Lacroix, Vincent; Rivals, Éric; Lemaitre, Claire; Miele, Vincent; Sacomoto, Gustavo; Marchet, Camille; Cazaux, Bastien; Zine El Aabidine, Amal; Salmela, Leena; Alves-Carvalho, Susete; Andrieux, Alexan; Uricaru, Raluca; Peterlongo, Pierre.

Gigascience ; 5: 9, 2016.

Article in English | MEDLINE | ID: mdl-26870323

ABSTRACT

BACKGROUND: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. FINDINGS: Dedicated to 'whole-genome assembly-free' treatments, the Colib'read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. CONCLUSIONS: With the Colib'read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.

Subject(s)

Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods , Information Storage and Retrieval/methods , Software , Base Sequence , Cluster Analysis , Genome/genetics , Genomics/methods , Molecular Sequence Data , Reproducibility of Results

6.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Benoit, Gaëtan; Lemaitre, Claire; Lavenier, Dominique; Drezen, Erwan; Dayris, Thibault; Uricaru, Raluca; Rizk, Guillaume.

BMC Bioinformatics ; 16: 288, 2015 Sep 14.

Article in English | MEDLINE | ID: mdl-26370285

ABSTRACT

BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS: We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS: LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.

Subject(s)

Algorithms , Caenorhabditis elegans Proteins/genetics , Caenorhabditis elegans/genetics , Computer Graphics , Data Compression/methods , High-Throughput Nucleotide Sequencing/methods , Software , Animals , Computational Biology/methods , Computer Simulation , Metagenomics , Probability

7.

YOC, A new strategy for pairwise alignment of collinear genomes.

Uricaru, Raluca; Michotey, Célia; Chiapello, Hélène; Rivals, Eric.

BMC Bioinformatics ; 16: 111, 2015 Apr 02.

Article in English | MEDLINE | ID: mdl-25885358

ABSTRACT

BACKGROUND: Comparing and aligning genomes is a key step in analyzing closely related genomes. Despite the development of many genome aligners in the last 15 years, the problem is not yet fully resolved, even when aligning closely related bacterial genomes of the same species. In addition, no procedures are available to assess the quality of genome alignments or to compare genome aligners. RESULTS: We designed an original method for pairwise genome alignment, named YOC, which employs a highly sensitive similarity detection method together with a recent collinear chaining strategy that allows overlaps. YOC improves the reliability of collinear genome alignments, while preserving or even improving sensitivity. We also propose an original qualitative evaluation criterion for measuring the relevance of genome alignments. We used this criterion to compare and benchmark YOC with five recent genome aligners on large bacterial genome datasets, and showed it is suitable for identifying the specificities and the potential flaws of their underlying strategies. CONCLUSIONS: The YOC prototype is available at https://github.com/ruricaru/YOC . It has several advantages over existing genome aligners: (1) it is based on a simplified two phase alignment strategy, (2) it is easy to parameterize, (3) it produces reliable genome alignments, which are easier to analyze and to use.

Subject(s)

User-Computer Interface , Algorithms , Comparative Genomic Hybridization , Genome, Bacterial , Internet , Lactococcus lactis/genetics , Sequence Alignment

8.

Reference-free detection of isolated SNPs.

Uricaru, Raluca; Rizk, Guillaume; Lacroix, Vincent; Quillery, Elsa; Plantard, Olivier; Chikhi, Rayan; Lemaitre, Claire; Peterlongo, Pierre.

Nucleic Acids Res ; 43(2): e11, 2015 Jan.

Article in English | MEDLINE | ID: mdl-25404127

ABSTRACT

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, discoSnp, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, discoSnp ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, discoSnp requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.

Subject(s)

Genotyping Techniques/methods , Polymorphism, Single Nucleotide , Algorithms , Animals , Chromosomes, Human, Pair 1 , Escherichia coli/genetics , Genomics/methods , Humans , Ixodes/genetics , Mice , Mice, Inbred C57BL , Saccharomyces cerevisiae/genetics

9.

Advantages of mixing bioinformatics and visualization approaches for analyzing sRNA-mediated regulatory bacterial networks.

Thébault, Patricia; Bourqui, Romain; Benchimol, William; Gaspin, Christine; Sirand-Pugnet, Pascal; Uricaru, Raluca; Dutour, Isabelle.

Brief Bioinform ; 16(5): 795-805, 2015 Sep.

Article in English | MEDLINE | ID: mdl-25477348

ABSTRACT

The revolution in high-throughput sequencing technologies has enabled the acquisition of gigabytes of RNA sequences in many different conditions and has highlighted an unexpected number of small RNAs (sRNAs) in bacteria. Ongoing exploitation of these data enables numerous applications for investigating bacterial transacting sRNA-mediated regulation networks. Focusing on sRNAs that regulate mRNA translation in trans, recent works have noted several sRNA-based regulatory pathways that are essential for key cellular processes. Although the number of known bacterial sRNAs is increasing, the experimental validation of their interactions with mRNA targets remains challenging and involves expensive and time-consuming experimental strategies. Hence, bioinformatics is crucial for selecting and prioritizing candidates before designing any experimental work. However, current software for target prediction produces a prohibitive number of candidates because of the lack of biological knowledge regarding the rules governing sRNA-mRNA interactions. Therefore, there is a real need to develop new approaches to help biologists focus on the most promising predicted sRNA-mRNA interactions. In this perspective, this review aims at presenting the advantages of mixing bioinformatics and visualization approaches for analyzing predicted sRNA-mediated regulatory bacterial networks.

Subject(s)

Computational Biology , Gene Regulatory Networks , RNA, Bacterial/physiology

10.

KISSPLICE: de-novo calling alternative splicing events from RNA-seq data.

Sacomoto, Gustavo A T; Kielbassa, Janice; Chikhi, Rayan; Uricaru, Raluca; Antoniou, Pavlos; Sagot, Marie-France; Peterlongo, Pierre; Lacroix, Vincent.

BMC Bioinformatics ; 13 Suppl 6: S5, 2012 Apr 19.

Article in English | MEDLINE | ID: mdl-22537044

ABSTRACT

BACKGROUND: In this paper, we address the problem of identifying and quantifying polymorphisms in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each polymorphism corresponds to a recognisable pattern in a De Bruijn graph constructed from the RNA-seq reads, we propose a general model for all polymorphisms in such graphs. We then introduce an exact algorithm, called KISSPLICE, to extract alternative splicing events. RESULTS: We show that KISSPLICE enables to identify more correct events than general purpose transcriptome assemblers. Additionally, on a 71 M reads dataset from human brain and liver tissues, KISSPLICE identified 3497 alternative splicing events, out of which 56% are not present in the annotations, which confirms recent estimates showing that the complexity of alternative splicing has been largely underestimated so far. CONCLUSIONS: We propose new models and algorithms for the detection of polymorphism in RNA-seq data. This opens the way to a new kind of studies on large HTS RNA-seq datasets, where the focus is not the global reconstruction of full-length transcripts, but local assembly of polymorphic regions. KISSPLICE is available for download at http://alcovna.genouest.org/kissplice/.

Subject(s)

Algorithms , Alternative Splicing , Models, Statistical , Sequence Analysis, RNA , Genome , Humans , Polymorphism, Single Nucleotide , Reference Standards , Tandem Repeat Sequences , Transcriptome

11.

Novel definition and algorithm for chaining fragments with proportional overlaps.

Uricaru, Raluca; Mancheron, Alban; Rivals, Eric.

J Comput Biol ; 18(9): 1141-54, 2011 Sep.

Article in English | MEDLINE | ID: mdl-21899421

ABSTRACT

Chaining fragments is a crucial step in genome alignment. Existing chaining algorithms compute a maximum weighted chain with no overlaps allowed between adjacent fragments. In practice, using local alignments as fragments, instead of Maximal Exact Matches (MEMs), generates frequent overlaps between fragments, due to combinatorial reasons and biological factors, i.e., variable tandem repeat structures that differ in number of copies between genomic sequences. In this article, in order to raise this limitation, we formulate a novel definition of a chain, allowing overlaps proportional to the fragments lengths, and exhibit an efficient algorithm for computing such a maximum weighted chain. We tested our algorithm on a dataset composed of 694 genome pairs and accounted for significant improvements in terms of coverage, while keeping the running times below reasonable limits. Moreover, experiments with different ratios of allowed overlaps showed the robustness of the chains with respect to these ratios. Our algorithm is implemented in a tool called OverlapChainer (OC), which is available upon request to the authors.

Subject(s)

Algorithms , Genome, Bacterial , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software

12.

An alternative approach to multiple genome comparison.

Mancheron, Alban; Uricaru, Raluca; Rivals, Eric.

Nucleic Acids Res ; 39(15): e101, 2011 Aug.

Article in English | MEDLINE | ID: mdl-21646341

ABSTRACT

Genome comparison is now a crucial step for genome annotation and identification of regulatory motifs. Genome comparison aims for instance at finding genomic regions either specific to or in one-to-one correspondence between individuals/strains/species. It serves e.g. to pre-annotate a new genome by automatically transferring annotations from a known one. However, efficiency, flexibility and objectives of current methods do not suit the whole spectrum of applications, genome sizes and organizations. Innovative approaches are still needed. Hence, we propose an alternative way of comparing multiple genomes based on segmentation by similarity. In this framework, rather than being formulated as a complex optimization problem, genome comparison is seen as a segmentation question for which a single optimal solution can be found in almost linear time. We apply our method to analyse three strains of a virulent pathogenic bacteria, Ehrlichia ruminantium, and identify 92 new genes. We also find out that a substantial number of genes thought to be strain specific have potential orthologs in the other strains. Our solution is implemented in an efficient program, qod, equipped with a user-friendly interface, and enables the automatic transfer of annotations between compared genomes or contigs (Video in Supplementary Data). Because it somehow disregards the relative order of genomic blocks, qod can handle unfinished genomes, which due to the difficulty of sequencing completion may become an interesting characteristic for the future. Availabilty: http://www.atgc-montpellier.fr/qod.

Subject(s)

Genomics/methods , Software , Algorithms , Ehrlichia ruminantium/classification , Ehrlichia ruminantium/genetics , Genes, Bacterial , Genome, Bacterial , Species Specificity

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL