Pesquisa | Portal Regional da BVS

Space-efficient representation of genomic k-mer count tables.

Shibuya, Yoshihiro; Belazzougui, Djamal; Kucherov, Gregory.

Algorithms Mol Biol ; 17(1): 5, 2022 Mar 21.

Artigo em Inglês | MEDLINE | ID: mdl-35317833

RESUMO

MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. RESULTS: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k's.

Fast and compact matching statistics analytics.

Cunial, Fabio; Denas, Olgert; Belazzougui, Djamal.

Bioinformatics ; 38(7): 1838-1845, 2022 03 28.

Artigo em Inglês | MEDLINE | ID: mdl-35134833

RESUMO

MOTIVATION: Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. RESULTS: We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. AVAILABILITY AND IMPLEMENTATION: Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Análise de Sequência de DNA/métodos , Genômica/métodos , Genoma

Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to k-Mer Annotation.

Shibuya, Yoshihiro; Belazzougui, Djamal; Kucherov, Gregory.

J Comput Biol ; 29(2): 140-154, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-35049334

RESUMO

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.

Assuntos

Biologia Computacional/métodos , Genômica/estatística & dados numéricos , Software , Algoritmos , Animais , Gráficos por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Humano , Humanos , Modelos Estatísticos , Anotação de Sequência Molecular/estatística & dados numéricos

DIAG a Diagnostic Web Application Based on Lung CT Scan Images and Deep Learning.

Hadj Bouzid, Amel Imene; Yahiaoui, Said; Lounis, Anis; Berrani, Sid-Ahmed; Belbachir, Hacène; Naïli, Qaïs; Abdi, Mohamed El Hafedh; Bensalah, Kawthar; Belazzougui, Djamal.

Stud Health Technol Inform ; 281: 332-336, 2021 May 27.

Artigo em Inglês | MEDLINE | ID: mdl-34042760

RESUMO

Coronavirus disease is a pandemic that has infected millions of people around the world. Lung CT-scans are effective diagnostic tools, but radiologists can quickly become overwhelmed by the flow of infected patients. Therefore, automated image interpretation needs to be achieved. Deep learning (DL) can support critical medical tasks including diagnostics, and DL algorithms have successfully been applied to the classification and detection of many diseases. This work aims to use deep learning methods that can classify patients between Covid-19 positive and healthy patient. We collected 4 available datasets, and tested our convolutional neural networks (CNNs) on different distributions to investigate the generalizability of our models. In order to clearly explain the predictions, Grad-CAM and Fast-CAM visualization methods were used. Our approach reaches more than 92% accuracy on 2 different distributions. In addition, we propose a computer aided diagnosis web application for Covid-19 diagnosis. The results suggest that our proposed deep learning tool can be integrated to the Covid-19 detection process and be useful for a rapid patient management.

Assuntos

COVID-19 , Aprendizado Profundo , Teste para COVID-19 , Humanos , Pulmão , SARS-CoV-2 , Tomografia Computadorizada por Raios X

A framework for space-efficient variable-order Markov models.

Cunial, Fabio; Alanko, Jarno; Belazzougui, Djamal.

Bioinformatics ; 35(22): 4607-4616, 2019 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-31004473

RESUMO

MOTIVATION: Markov models with contexts of variable length are widely used in bioinformatics for representing sets of sequences with similar biological properties. When models contain many long contexts, existing implementations are either unable to handle genome-scale training datasets within typical memory budgets, or they are optimized for specific model variants and are thus inflexible. RESULTS: We provide practical, versatile representations of variable-order Markov models and of interpolated Markov models, that support a large number of context-selection criteria, scoring functions, probability smoothing methods, and interpolations, and that take up to four times less space than previous implementations based on the suffix array, regardless of the number and length of contexts, and up to ten times less space than previous trie-based representations, or more, while matching the size of related, state-of-the-art data structures from Natural Language Processing. We describe how to further compress our indexes to a quantity related to the redundancy of the training data, saving up to 90% of their space on very repetitive datasets, and making them become up to 60 times smaller than previous implementations based on the suffix array. Finally, we show how to exploit constraints on the length and frequency of contexts to further shrink our compressed indexes to half of their size or more, achieving data structures that are a hundred times smaller than previous implementations based on the suffix array, or more. This allows variable-order Markov models to be used with bigger datasets and with longer contexts on the same hardware, thus possibly enabling new applications. AVAILABILITY AND IMPLEMENTATION: https://github.com/jnalanko/VOMM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Genoma , Probabilidade

A framework for space-efficient read clustering in metagenomic samples.

Alanko, Jarno; Cunial, Fabio; Belazzougui, Djamal; Mäkinen, Veli.

BMC Bioinformatics ; 18(Suppl 3): 59, 2017 Mar 14.

Artigo em Inglês | MEDLINE | ID: mdl-28361710

RESUMO

BACKGROUND: A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. RESULTS: We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length â each, on an alphabet of total size σ, our algorithms take O(n(t+logσ)) time and just 2n+o(n)+O(max{â σlogn,K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. CONCLUSIONS: Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.

Assuntos

Algoritmos , Biologia Computacional/métodos , Fragmentação do DNA , Metagenômica , Análise por Conglomerados , Modelos Teóricos , Análise de Sequência de DNA

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA