Pesquisa | Portal Regional da BVS

A machine learning approach to model the impact of line edge roughness on gate-all-around nanowire FETs while reducing the carbon footprint.

García-Loureiro, Antonio; Seoane, Natalia; Fernández, Julián G; Comesaña, Enrique; Pichel, Juan C.

PLoS One ; 18(7): e0288964, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37486944

RESUMO

The performance and reliability of semiconductor devices scaled down to the sub-nanometer regime are being seriously affected by process-induced variability. To properly assess the impact of the different sources of fluctuations, such as line edge roughness (LER), statistical analyses involving large samples of device configurations are needed. The computational cost of such studies can be very high if 3D advanced simulation tools (TCAD) that include quantum effects are used. In this work, we present a machine learning approach to model the impact of LER on two gate-all-around nanowire FETs that is able to dramatically decrease the computational effort, thus reducing the carbon footprint of the study, while obtaining great accuracy. Finally, we demonstrate that transfer learning techniques can decrease the computing cost even further, being the carbon footprint of the study just 0.18 g of CO2 (whereas a single device TCAD study can produce up to 2.6 kg of CO2), while obtaining coefficient of determination values larger than 0.985 when using only a 10% of the input samples.

Assuntos

Pegada de Carbono , Nanofios , Dióxido de Carbono , Reprodutibilidade dos Testes , Aprendizado de Máquina

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale.

Piñeiro, César; Pichel, Juan C.

Gigascience ; 122022 12 28.

Artigo em Inglês | MEDLINE | ID: mdl-37522758

RESUMO

BACKGROUND: High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node. RESULTS: Our approach, BigSeqKit, takes advantage of a high-performance computing-Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line. CONCLUSIONS: BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.

Assuntos

Big Data , Biologia Computacional , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Conhecimento

A Big Data Platform for Real Time Analysis of Signs of Depression in Social Media.

Martínez-Castaño, Rodrigo; Pichel, Juan C; Losada, David E.

Int J Environ Res Public Health ; 17(13)2020 07 01.

Artigo em Inglês | MEDLINE | ID: mdl-32630341

RESUMO

In this paper we propose a scalable platform for real-time processing of Social Media data. The platform ingests huge amounts of contents, such as Social Media posts or comments, and can support Public Health surveillance tasks. The processing and analytical needs of multiple screening tasks can easily be handled by incorporating user-defined execution graphs. The design is modular and supports different processing elements, such as crawlers to extract relevant contents or classifiers to categorise Social Media. We describe here an implementation of a use case built on the platform that monitors Social Media users and detects early signs of depression.

Assuntos

Depressão/epidemiologia , Mídias Sociais , Big Data

Very Fast Tree: speeding up the estimation of phylogenies for large alignments through parallelization and vectorization strategies.

Piñeiro, César; Abuín, José M; Pichel, Juan C.

Bioinformatics ; 36(17): 4658-4659, 2020 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-32573652

RESUMO

MOTIVATION: FastTree-2 is one of the most successful tools for inferring large phylogenies. With speed at the core of its design, there are still important issues in the FastTree-2 implementation that harm its performance and scalability. To deal with these limitations, we introduce VeryFastTree, a highly tuned implementation of the FastTree-2 tool that takes advantage of parallelization and vectorization strategies to boost performance. RESULTS: VeryFastTree is able to construct a tree on a standard server using double-precision arithmetic from an ultra-large 330k alignment in only 4.5 h, which is 7.8× and 3.5× faster than the sequential and best parallel FastTree-2 times, respectively. AVAILABILITY AND IMPLEMENTATION: VeryFastTree is available at the GitHub repository: https://github.com/citiususc/veryfasttree. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Software , Árvores , Algoritmos , Computadores , Filogenia , Alinhamento de Sequência

A big data approach to metagenomics for all-food-sequencing.

Kobus, Robin; Abuín, José M; Müller, André; Hellmann, Sören Lukas; Pichel, Juan C; Pena, Tomás F; Hildebrandt, Andreas; Hankeln, Thomas; Schmidt, Bertil.

BMC Bioinformatics ; 21(1): 102, 2020 Mar 12.

Artigo em Inglês | MEDLINE | ID: mdl-32164527

RESUMO

BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).

Assuntos

Big Data , Análise de Alimentos/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Sequenciamento Completo do Genoma/métodos , Biovigilância , Genoma Bacteriano , Metagenoma , Microbiota/genética , Software

PASTASpark: multiple sequence alignment meets Big Data.

Abuín, José M; Pena, Tomás F; Pichel, Juan C.

Bioinformatics ; 33(18): 2948-2950, 2017 Sep 15.

Artigo em Inglês | MEDLINE | ID: mdl-28582480

RESUMO

MOTIVATION: One basic step in many bioinformatics analyses is the multiple sequence alignment. One of the state-of-the-art tools to perform multiple sequence alignment is PASTA (Practical Alignments using SATé and TrAnsitivity). PASTA supports multithreading but it is limited to process datasets on shared memory systems. In this work we introduce PASTASpark, a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of PASTA, which is the most expensive task in terms of time consumption. RESULTS: Speedups up to 10× with respect to single-threaded PASTA were observed, which allows to process an ultra-large dataset of 200 000 sequences within the 24-h limit. AVAILABILITY AND IMPLEMENTATION: PASTASpark is an Open Source tool available at https://github.com/citiususc/pastaspark. CONTACT: josemanuel.abuin@usc.es. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge.

PLoS One ; 11(5): e0155461, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27182962

RESUMO

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

Assuntos

Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Software , Humanos , Reprodutibilidade dos Testes , Análise de Sequência de DNA/métodos , Navegador , Fluxo de Trabalho

BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies.

Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge.

Bioinformatics ; 31(24): 4003-5, 2015 Dec 15.

Artigo em Inglês | MEDLINE | ID: mdl-26323715

RESUMO

UNLABELLED: BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows-Wheeler aligner (BWA). Important reductions in the execution times were observed when using this tool. In addition, BigBWA is fault tolerant and it does not require any modification of the original BWA source code. AVAILABILITY AND IMPLEMENTATION: BigBWA is available at the project GitHub repository: https://github.com/citiususc/BigBWA.

Assuntos

Alinhamento de Sequência/métodos , Software , Algoritmos , Genômica

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA