Pesquisa | Portal Regional da BVS

iMOKA: k-mer based software to analyze large collections of sequencing data.

Lorenzi, Claudio; Barriere, Sylvain; Villemin, Jean-Philippe; Dejardin Bretones, Laureline; Mancheron, Alban; Ritchie, William.

Genome Biol ; 21(1): 261, 2020 10 13.

Artigo em Inglês | MEDLINE | ID: mdl-33050927

RESUMO

iMOKA (interactive multi-objective k-mer analysis) is a software that enables comprehensive analysis of sequencing data from large cohorts to generate robust classification models or explore specific genetic elements associated with disease etiology. iMOKA uses a fast and accurate feature reduction step that combines a Naïve Bayes classifier augmented by an adaptive entropy filter and a graph-based filter to rapidly reduce the search space. By using a flexible file format and distributed indexing, iMOKA can easily integrate data from multiple experiments and also reduces disk space requirements and identifies changes in transcript levels and single nucleotide variants. iMOKA is available at https://github.com/RitchieLabIGH/iMOKA and Zenodo https://doi.org/10.5281/zenodo.4008947 .

Assuntos

Análise de Sequência de DNA , Software , Algoritmos , Neoplasias da Mama/classificação , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/genética , Resistencia a Medicamentos Antineoplásicos/genética , Feminino , Humanos , Neoplasias Ovarianas/tratamento farmacológico , Neoplasias Ovarianas/genética , Variantes Farmacogenômicos

GECKO is a genetic algorithm to classify and explore high throughput sequencing data.

Thomas, Aubin; Barriere, Sylvain; Broseus, Lucile; Brooke, Julie; Lorenzi, Claudio; Villemin, Jean-Philippe; Beurier, Gregory; Sabatier, Robert; Reynes, Christelle; Mancheron, Alban; Ritchie, William.

Commun Biol ; 2: 222, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31240260

RESUMO

Comparative analysis of high throughput sequencing data between multiple conditions often involves mapping of sequencing reads to a reference and downstream bioinformatics analyses. Both of these steps may introduce heavy bias and potential data loss. This is especially true in studies where patient transcriptomes or genomes may vary from their references, such as in cancer. Here we describe a novel approach and associated software that makes use of advances in genetic algorithms and feature selection to comprehensively explore massive volumes of sequencing data to classify and discover new sequences of interest without a mapping step and without intensive use of specialized bioinformatics pipelines. We demonstrate that our approach called GECKO for GEnetic Classification using k-mer Optimization is effective at classifying and extracting meaningful sequences from multiple types of sequencing approaches including mRNA, microRNA, and DNA methylome data.

Assuntos

Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Células Sanguíneas , Neoplasias da Mama/classificação , Neoplasias da Mama/genética , Biologia Computacional/métodos , Metilação de DNA , Humanos , MicroRNAs , Mutação , RNA Mensageiro , Software

Combining DGE and RNA-sequencing data to identify new polyA+ non-coding transcripts in the human genome.

Philippe, Nicolas; Bou Samra, Elias; Boureux, Anthony; Mancheron, Alban; Rufflé, Florence; Bai, Qiang; De Vos, John; Rivals, Eric; Commes, Thérèse.

Nucleic Acids Res ; 42(5): 2820-32, 2014 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-24357408

RESUMO

Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as 'TranscriRef'). We then annotated 750,000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified â¼34,000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.

Assuntos

Perfilação da Expressão Gênica/métodos , Genoma Humano , RNA não Traduzido/análise , Análise de Sequência de RNA/métodos , Linhagem Celular , Humanos , Anotação de Sequência Molecular , Poli A/análise , Software , Transcrição Gênica

Novel definition and algorithm for chaining fragments with proportional overlaps.

Uricaru, Raluca; Mancheron, Alban; Rivals, Eric.

J Comput Biol ; 18(9): 1141-54, 2011 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-21899421

RESUMO

Chaining fragments is a crucial step in genome alignment. Existing chaining algorithms compute a maximum weighted chain with no overlaps allowed between adjacent fragments. In practice, using local alignments as fragments, instead of Maximal Exact Matches (MEMs), generates frequent overlaps between fragments, due to combinatorial reasons and biological factors, i.e., variable tandem repeat structures that differ in number of copies between genomic sequences. In this article, in order to raise this limitation, we formulate a novel definition of a chain, allowing overlaps proportional to the fragments lengths, and exhibit an efficient algorithm for computing such a maximum weighted chain. We tested our algorithm on a dataset composed of 694 genome pairs and accounted for significant improvements in terms of coverage, while keeping the running times below reasonable limits. Moreover, experiments with different ratios of allowed overlaps showed the robustness of the chains with respect to these ratios. Our algorithm is implemented in a tool called OverlapChainer (OC), which is available upon request to the authors.

Assuntos

Algoritmos , Genoma Bacteriano , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software

An alternative approach to multiple genome comparison.

Mancheron, Alban; Uricaru, Raluca; Rivals, Eric.

Nucleic Acids Res ; 39(15): e101, 2011 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-21646341

RESUMO

Genome comparison is now a crucial step for genome annotation and identification of regulatory motifs. Genome comparison aims for instance at finding genomic regions either specific to or in one-to-one correspondence between individuals/strains/species. It serves e.g. to pre-annotate a new genome by automatically transferring annotations from a known one. However, efficiency, flexibility and objectives of current methods do not suit the whole spectrum of applications, genome sizes and organizations. Innovative approaches are still needed. Hence, we propose an alternative way of comparing multiple genomes based on segmentation by similarity. In this framework, rather than being formulated as a complex optimization problem, genome comparison is seen as a segmentation question for which a single optimal solution can be found in almost linear time. We apply our method to analyse three strains of a virulent pathogenic bacteria, Ehrlichia ruminantium, and identify 92 new genes. We also find out that a substantial number of genes thought to be strain specific have potential orthologs in the other strains. Our solution is implemented in an efficient program, qod, equipped with a user-friendly interface, and enables the automatic transfer of annotations between compared genomes or contigs (Video in Supplementary Data). Because it somehow disregards the relative order of genomic blocks, qod can handle unfinished genomes, which due to the difficulty of sequencing completion may become an interesting characteristic for the future. Availabilty: http://www.atgc-montpellier.fr/qod.

Assuntos

Genômica/métodos , Software , Algoritmos , Ehrlichia ruminantium/classificação , Ehrlichia ruminantium/genética , Genes Bacterianos , Genoma Bacteriano , Especificidade da Espécie

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA