Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
Microorganisms ; 10(4)2022 Mar 25.
Article in English | MEDLINE | ID: mdl-35456762

ABSTRACT

Metagenomics analysis is now routinely used for clinical diagnosis in several diseases, and we need confidence in interpreting metagenomics analysis of microbiota. Particularly from the side of clinical microbiology, we consider that it would be a major milestone to further advance microbiota studies with an innovative and significant approach consisting of processing steps and quality assessment for interpreting metagenomics data used for diagnosis. Here, we propose a methodology for taxon identification and abundance assessment of shotgun sequencing data of microbes that are well fitted for clinical setup. Processing steps of quality controls have been developed in order (i) to avoid low-quality reads and sequences, (ii) to optimize abundance thresholds and profiles, (iii) to combine classifiers and reference databases for best classification of species and abundance profiles for both prokaryotic and eukaryotic sequences, and (iv) to introduce external positive control. We find that the best strategy is to use a pipeline composed of a combination of different but complementary classifiers such as Kraken2/Bracken and Kaiju. Such improved quality assessment will have a major impact on the robustness of biological and clinical conclusions drawn from metagenomic studies.

2.
Nat Methods ; 14(11): 1063-1071, 2017 Nov.
Article in English | MEDLINE | ID: mdl-28967888

ABSTRACT

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


Subject(s)
Metagenomics , Software , Algorithms , Benchmarking , Sequence Analysis, DNA
3.
Nucleic Acids Res ; 45(8): e57, 2017 05 05.
Article in English | MEDLINE | ID: mdl-28053114

ABSTRACT

Whole transcriptome sequencing (RNA-seq) has become a standard for cataloguing and monitoring RNA populations. One of the main bottlenecks, however, is to correctly identify the different classes of RNAs among the plethora of reconstructed transcripts, particularly those that will be translated (mRNAs) from the class of long non-coding RNAs (lncRNAs). Here, we present FEELnc (FlExible Extraction of LncRNAs), an alignment-free program that accurately annotates lncRNAs based on a Random Forest model trained with general features such as multi k-mer frequencies and relaxed open reading frames. Benchmarking versus five state-of-the-art tools shows that FEELnc achieves similar or better classification performance on GENCODE and NONCODE data sets. The program also provides specific modules that enable the user to fine-tune classification accuracy, to formalize the annotation of lncRNA classes and to identify lncRNAs even in the absence of a training set of non-coding RNAs. We used FEELnc on a real data set comprising 20 canine RNA-seq samples produced by the European LUPA consortium to substantially expand the canine genome annotation to include 10 374 novel lncRNAs and 58 640 mRNA transcripts. FEELnc moves beyond conventional coding potential classifiers by providing a standardized and complete solution for annotating lncRNAs and is freely available at https://github.com/tderrien/FEELnc.


Subject(s)
Genome , Molecular Sequence Annotation/methods , RNA, Long Noncoding/genetics , Software , Transcriptome , Animals , Benchmarking , Decision Trees , Dogs , Gene Expression Regulation , Humans , Mice , Molecular Sequence Annotation/statistics & numerical data , Open Reading Frames , RNA, Long Noncoding/classification , RNA, Long Noncoding/metabolism , RNA, Messenger/classification , RNA, Messenger/genetics , RNA, Messenger/metabolism , Sequence Analysis, RNA
4.
BMC Bioinformatics ; 16: 288, 2015 Sep 14.
Article in English | MEDLINE | ID: mdl-26370285

ABSTRACT

BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS: We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS: LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.


Subject(s)
Algorithms , Caenorhabditis elegans Proteins/genetics , Caenorhabditis elegans/genetics , Computer Graphics , Data Compression/methods , High-Throughput Nucleotide Sequencing/methods , Software , Animals , Computational Biology/methods , Computer Simulation , Metagenomics , Probability
5.
Nucleic Acids Res ; 43(2): e11, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25404127

ABSTRACT

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, discoSnp, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, discoSnp ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, discoSnp requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.


Subject(s)
Genotyping Techniques/methods , Polymorphism, Single Nucleotide , Algorithms , Animals , Chromosomes, Human, Pair 1 , Escherichia coli/genetics , Genomics/methods , Humans , Ixodes/genetics , Mice , Mice, Inbred C57BL , Saccharomyces cerevisiae/genetics
6.
Bioinformatics ; 30(24): 3451-7, 2014 Dec 15.
Article in English | MEDLINE | ID: mdl-25123898

ABSTRACT

MOTIVATION: Insertions play an important role in genome evolution. However, such variants are difficult to detect from short-read sequencing data, especially when they exceed the paired-end insert size. Many approaches have been proposed to call short insertion variants based on paired-end mapping. However, there remains a lack of practical methods to detect and assemble long variants. RESULTS: We propose here an original method, called MindTheGap, for the integrated detection and assembly of insertion variants from re-sequencing data. Importantly, it is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. MindTheGap uses an efficient k-mer-based method to detect insertion sites in a reference genome, and subsequently assemble them from the donor reads. MindTheGap showed high recall and precision on simulated datasets of various genome complexities. When applied to real Caenorhabditis elegans and human NA12878 datasets, MindTheGap detected and correctly assembled insertions >1 kb, using at most 14 GB of memory.


Subject(s)
Mutagenesis, Insertional , Sequence Analysis, DNA/methods , Animals , Caenorhabditis elegans/genetics , Genome , Humans , Software
7.
Bioinformatics ; 30(20): 2959-61, 2014 Oct 15.
Article in English | MEDLINE | ID: mdl-24990603

ABSTRACT

MOTIVATION: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation. RESULTS: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints. AVAILABILITY AND IMPLEMENTATION: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license. CONTACT: lavenier@irisa.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Biostatistics/methods , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Software , Algorithms , Computer Graphics , Genome, Human/genetics , Humans
8.
Algorithms Mol Biol ; 8(1): 22, 2013 Sep 16.
Article in English | MEDLINE | ID: mdl-24040893

ABSTRACT

BACKGROUND: The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB). RESULTS: We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. CONCLUSIONS: An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

9.
Bioinformatics ; 29(5): 652-3, 2013 Mar 01.
Article in English | MEDLINE | ID: mdl-23325618

ABSTRACT

SUMMARY: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers. AVAILABILITY: http://minia.genouest.org/dsk


Subject(s)
Sequence Analysis, DNA/methods , Sequence Analysis, RNA/methods , Software , Algorithms , Genome, Human , Humans
10.
Bioinformatics ; 26(20): 2534-40, 2010 Oct 15.
Article in English | MEDLINE | ID: mdl-20739310

ABSTRACT

MOTIVATION: The rapid development of next-generation sequencing technologies able to produce huge amounts of sequence data is leading to a wide range of new applications. This triggers the need for fast and accurate alignment software. Common techniques often restrict indels in the alignment to improve speed, whereas more flexible aligners are too slow for large-scale applications. Moreover, many current aligners are becoming inefficient as generated reads grow ever larger. Our goal with our new aligner GASSST (Global Alignment Short Sequence Search Tool) is thus 2-fold-achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. RESULTS: We propose a new efficient filtering step that discards most alignments coming from the seed phase before they are checked by the costly dynamic programming algorithm. We use a carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments in a wide range of configurations. The main filter uses a precomputed table containing the alignment score of short four base words aligned against each other. This table is reused several times by a new algorithm designed to approximate the score of the full dynamic programming algorithm. We compare the performance of GASSST against BWA, BFAST, SSAHA2 and PASS. We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners. AVAILABILITY: GASSST is distributed under the CeCILL software license at http://www.irisa.fr/symbiose/projects/gassst/ CONTACT: guillaume.rizk@irisa.fr; dominique.lavenier@irisa.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Sequence Alignment/methods , Software , Base Sequence , Sequence Analysis, DNA/methods
11.
BMC Genomics ; 11: 281, 2010 May 05.
Article in English | MEDLINE | ID: mdl-20444247

ABSTRACT

BACKGROUND: Post-transcriptional regulation in eukaryotes can be operated through microRNA (miRNAs) mediated gene silencing. MiRNAs are small (18-25 nucleotides) non-coding RNAs that play crucial role in regulation of gene expression in eukaryotes. In insects, miRNAs have been shown to be involved in multiple mechanisms such as embryonic development, tissue differentiation, metamorphosis or circadian rhythm. Insect miRNAs have been identified in different species belonging to five orders: Coleoptera, Diptera, Hymenoptera, Lepidoptera and Orthoptera. RESULTS: We developed high throughput Solexa sequencing and bioinformatic analyses of the genome of the pea aphid Acyrthosiphon pisum in order to identify the first miRNAs from a hemipteran insect. By combining these methods we identified 149 miRNAs including 55 conserved and 94 new miRNAs. Moreover, we investigated the regulation of these miRNAs in different alternative morphs of the pea aphid by analysing the expression of miRNAs across the switch of reproduction mode. Pea aphid microRNA sequences have been posted to miRBase: http://microrna.sanger.ac.uk/sequences/. CONCLUSIONS: Our study has identified candidates as putative regulators involved in reproductive polyphenism in aphids and opens new avenues for further functional analyses.


Subject(s)
Aphids/genetics , Gene Expression Profiling , MicroRNAs/analysis , Animals , Base Sequence , MicroRNAs/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...