Pesquisa | Portal Regional da BVS

ViQUF: De Novo Viral Quasispecies Reconstruction Using Unitig-Based Flow Networks.

Freire, Borja; Ladra, Susana; Parama, Jose R; Salmela, Leena.

IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1550-1562, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-35853050

RESUMO

During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches.

Assuntos

Quase-Espécies , Software , Quase-Espécies/genética , Sequenciamento de Nucleotídeos em Larga Escala , Haplótipos/genética , Análise de Sequência de DNA/métodos , Algoritmos

Memory-Efficient Assembly Using Flye.

Freire, Borja; Ladra, Susana; Parama, Jose R.

IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3564-3577, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-34469305

RESUMO

In the past decade, next-generation sequencing (NGS) enabled the generation of genomic data in a cost-effective, high-throughput manner. The most recent third-generation sequencing technologies produce longer reads; however, their error rates are much higher, which complicates the assembly process. This generates time- and space- demanding long-read assemblers. Moreover, the advances in these technologies have allowed portable and real-time DNA sequencing, enabling in-field analysis. In these scenarios, it becomes crucial to have more efficient solutions that can be executed in computers or mobile devices with minimum hardware requirements. We re-implemented an existing assembler devoted for long reads, more concretely Flye, using compressed data structures. We then compare our version with the original software using real datasets, and evaluate their performance in terms of memory requirements, execution speed, and energy consumption. The assembly results are not affected, as the core of the algorithm is maintained, but the usage of advanced compact data structures leads to improvements in memory consumption that range from 22% to 47% less space, and in the processing time, which range from being on a par up to decreases of 25%. These improvements also cause reductions in energy consumption of around 3-8%, with some datasets obtaining decreases up to 26%.

Assuntos

Genômica , Software , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos

Inference of viral quasispecies with a paired de Bruijn graph.

Freire, Borja; Ladra, Susana; Paramá, Jose R; Salmela, Leena.

Bioinformatics ; 37(4): 473-481, 2021 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-32926162

RESUMO

MOTIVATION: RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. RESULTS: We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. AVAILABILITY AND IMPLEMENTATION: viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Quase-Espécies , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA

Efficient processing of raster and vector data.

Silva-Coira, Fernando; Paramá, José R; Ladra, Susana; López, Juan R; Gutiérrez, Gilberto.

PLoS One ; 15(1): e0226943, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-31923261

RESUMO

In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of the raster; and an algorithm for retrieving K objects of a vector dataset that overlap cells of a raster dataset, such that the K objects are those overlapping the highest (or lowest) cell values among all objects. The raster data is stored using a compact data structure, which can directly manipulate compressed data without the need for prior decompression. This leads to better running times and lower memory consumption. In our experimental evaluation comparing our solution to other baselines, we obtain the best space/time trade-offs.

Assuntos

Compressão de Dados/métodos , Armazenamento e Recuperação da Informação/métodos , Algoritmos , Compressão de Dados/normas , Conjuntos de Dados como Assunto , Armazenamento e Recuperação da Informação/normas

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA