Pesquisa | Portal Regional da BVS (teste)

Moving Just Enough Deep Sequencing Data to Get the Job Done.

Mills, Nicholas; Bensman, Ethan M; Poehlman, William L; Ligon, Walter B; Feltus, F Alex.

Bioinform Biol Insights ; 13: 1177932219856359, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31236009

RESUMO

MOTIVATION: As the size of high-throughput DNA sequence datasets continues to grow, the cost of transferring and storing the datasets may prevent their processing in all but the largest data centers or commercial cloud providers. To lower this cost, it should be possible to process only a subset of the original data while still preserving the biological information of interest. RESULTS: Using 4 high-throughput DNA sequence datasets of differing sequencing depth from 2 species as use cases, we demonstrate the effect of processing partial datasets on the number of detected RNA transcripts using an RNA-Seq workflow. We used transcript detection to decide on a cutoff point. We then physically transferred the minimal partial dataset and compared with the transfer of the full dataset, which showed a reduction of approximately 25% in the total transfer time. These results suggest that as sequencing datasets get larger, one way to speed up analysis is to simply transfer the minimal amount of data that still sufficiently detects biological signal. AVAILABILITY: All results were generated using public datasets from NCBI and publicly available open source software.

The Widening Gulf between Genomics Data Generation and Consumption: A Practical Guide to Big Data Transfer Technology.

Feltus, Frank A; Breen, Joseph R; Deng, Juan; Izard, Ryan S; Konger, Christopher A; Ligon, Walter B; Preuss, Don; Wang, Kuang-Ching.

Bioinform Biol Insights ; 9(Suppl 1): 9-19, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26568680

RESUMO

In the last decade, high-throughput DNA sequencing has become a disruptive technology and pushed the life sciences into a distributed ecosystem of sequence data producers and consumers. Given the power of genomics and declining sequencing costs, biology is an emerging "Big Data" discipline that will soon enter the exabyte data range when all subdisciplines are combined. These datasets must be transferred across commercial and research networks in creative ways since sending data without thought can have serious consequences on data processing time frames. Thus, it is imperative that biologists, bioinformaticians, and information technology engineers recalibrate data processing paradigms to fit this emerging reality. This review attempts to provide a snapshot of Big Data transfer across networks, which is often overlooked by many biologists. Specifically, we discuss four key areas: 1) data transfer networks, protocols, and applications; 2) data transfer security including encryption, access, firewalls, and the Science DMZ; 3) data flow control with software-defined networking; and 4) data storage, staging, archiving and access. A primary intention of this article is to orient the biologist in key aspects of the data transfer process in order to frame their genomics-oriented needs to enterprise IT professionals.

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA