Pesquisa | Portal Regional da BVS (teste)

SeQual-Stream: approaching stream processing to quality control of NGS datasets.

Castellanos-Rodríguez, Óscar; Expósito, Roberto R; Touriño, Juan.

BMC Bioinformatics ; 24(1): 403, 2023 Oct 27.

Artigo em Inglês | MEDLINE | ID: mdl-37891497

RESUMO

BACKGROUND: Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing. RESULTS: In this paper we present SeQual-Stream, a streaming tool that allows performing multiple quality control operations on genomic datasets in a fast, distributed and scalable way. To do so, our approach relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements in the execution times of SeQual-Stream when compared to a batch processing tool with similar quality control features, providing a maximum speedup of 2.7[Formula: see text] when processing a dataset with more than 250 million DNA sequences, while also demonstrating good scalability features. CONCLUSION: Our solution provides a more scalable and higher performance way to carry out quality control of large genomic datasets by taking advantage of stream processing features. The tool is distributed as free open-source software released under the GNU AGPLv3 license and is publicly available to download at https://github.com/UDC-GAC/SeQual-Stream .

Assuntos

Genômica , Software , Genômica/métodos , Genoma , Sequência de Bases , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos

ParRADMeth: Identification of Differentially Methylated Regions on Multicore Clusters.

Fernandez-Fraga, Alejandro; Gonzalez-Dominguez, Jorge; Tourino, Juan.

IEEE/ACM Trans Comput Biol Bioinform ; 20(3): 2041-2049, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37015593

RESUMO

The discovery of Differentially Methylated (DM) regions is an important research field in biology, as it can help to anticipate the risk of suffering from specific diseases. Nevertheless, the high computational cost of the bioinformatic tools developed for this purpose prevents their application to large-scale datasets. Hence, much faster tools are required to further progress in this research field. In this work we present ParRADMeth, a parallel tool that applies beta-binomial regression for the identification of these DM regions. It is based on the state-of-the-art sequential tool RADMeth, which proved superior biological accuracy compared to counterparts in previous experimental evaluations. ParRADMeth provides the same DM regions as RADMeth but at significantly reduced runtime thanks to exploiting the compute capabilities of common multicore CPU clusters. For example, our tool is up to 189 times faster for real data experiments on a cluster with 16 nodes, each one containing two eight-core processors. The source code of ParRADMeth, as well as a reference manual, are available at https://github.com/UDC-GAC/ParRADMeth.

Assuntos

Biologia Computacional , Software , Algoritmos

PATO: genome-wide prediction of lncRNA-DNA triple helices.

Amatria-Barral, Iñaki; González-Domínguez, Jorge; Touriño, Juan.

Bioinformatics ; 39(3)2023 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-36924420

RESUMO

MOTIVATION: Long non-coding RNA (lncRNA) plays a key role in many biological processes. For instance, lncRNA regulates chromatin using different molecular mechanisms, including direct RNA-DNA hybridization via triplexes, cotranscriptional RNA-RNA interactions, and RNA-DNA binding mediated by protein complexes. While the functional annotation of lncRNA transcripts has been widely studied over the last 20 years, barely a handful of tools have been developed with the specific purpose of detecting and evaluating lncRNA-DNA triple helices. What is worse, some of these tools have nearly grown a decade old, making new triplex-centric pipelines depend on legacy software that cannot thoroughly process all the data made available by next-generation sequencing (NGS) technologies. RESULTS: We present PATO, a modern, fast, and efficient tool for the detection of lncRNA-DNA triplexes that matches NGS processing capabilities. PATO enables the prediction of triple helices at the genome scale and can process in as little as 1 h more than 60 GB of sequence data using a two-socket server. Moreover, PATO's efficiency allows a more exhaustive search of the triplex-forming solution space, and so PATO achieves higher levels of prediction accuracy in far less time than other tools in the state of the art. AVAILABILITY AND IMPLEMENTATION: Source code, user manual, and tests are freely available to download under the MIT License at https://github.com/UDC-GAC/pato.

Assuntos

RNA Longo não Codificante , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismo , DNA/metabolismo , Software

ï»¿SparkEC: speeding up alignment-based DNA error correction tools.

Expósito, Roberto R; Martínez-Sánchez, Marco; Touriño, Juan.

BMC Bioinformatics ; 23(1): 464, 2022 Nov 07.

Artigo em Inglês | MEDLINE | ID: mdl-36344928

RESUMO

BACKGROUND: In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. RESULTS: In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9[Formula: see text] and 11.9[Formula: see text], respectively, over its counterpart. CONCLUSION: As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS).

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica/métodos , Algoritmos , DNA/genética

HSRA: Hadoop-based spliced read aligner for RNA sequencing data.

Expósito, Roberto R; González-Domínguez, Jorge; Touriño, Juan.

PLoS One ; 13(7): e0201483, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-30063721

RESUMO

Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user's guide are publicly available for download at http://hsra.dec.udc.es.

Assuntos

Big Data , Sequenciamento de Nucleotídeos em Larga Escala , Dobramento de RNA , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Software

MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud.

Expósito, Roberto R; Veiga, Jorge; González-Domínguez, Jorge; Touriño, Juan.

Bioinformatics ; 33(17): 2762-2764, 2017 Sep 01.

Artigo em Inglês | MEDLINE | ID: mdl-28475668

RESUMO

SUMMARY: This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing. On a 16-node cluster deployed on the Amazon EC2 cloud platform, MarDRe is up to 8.52 times faster than a representative state-of-the-art tool. AVAILABILITY AND IMPLEMENTATION: Source code in Java and Hadoop as well as a user's guide are freely available under the GNU GPLv3 license at http://mardre.des.udc.es . CONTACT: rreye@udc.es.

Assuntos

Análise de Sequência de DNA/métodos , Software , Algoritmos

MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems.

González-Domínguez, Jorge; Liu, Yongchao; Touriño, Juan; Schmidt, Bertil.

Bioinformatics ; 32(24): 3826-3828, 2016 12 15.

Artigo em Inglês | MEDLINE | ID: mdl-27638400

RESUMO

MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively. AVAILABILITY AND IMPLEMENTATION: Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at http://msaprobs.sourceforge.net CONTACT: jgonzalezd@udc.esSupplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Biologia Computacional/métodos , Proteínas , Alinhamento de Sequência/métodos , Algoritmos , Sequência de Aminoácidos , Cadeias de Markov , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA