Pesquisa | Portal Regional da BVS

CloudForest: A Scalable and Efficient Random Forest Implementation for Biological Data.

Bressler, Ryan; Kreisberg, Richard B; Bernard, Brady; Niederhuber, John E; Vockley, Joseph G; Shmulevich, Ilya; Knijnenburg, Theo A.

PLoS One ; 10(12): e0144820, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26679347

RESUMO

Random Forest has become a standard data analysis tool in computational biology. However, extensions to existing implementations are often necessary to handle the complexity of biological datasets and their associated research questions. The growing size of these datasets requires high performance implementations. We describe CloudForest, a Random Forest package written in Go, which is particularly well suited for large, heterogeneous, genetic and biomedical datasets. CloudForest includes several extensions, such as dealing with unbalanced classes and missing values. Its flexible design enables users to easily implement additional extensions. CloudForest achieves fast running times by effective use of the CPU cache, optimizing for different classes of features and efficiently multi-threading. https://github.com/ilyalab/CloudForest.

Assuntos

Biologia Computacional/métodos , Classificação , Interpretação Estatística de Dados , Linguagens de Programação , Análise de Regressão , Software

Fastbreak: a tool for analysis and visualization of structural variations in genomic data.

Bressler, Ryan; Lin, Jake; Eakin, Andrea; Robinson, Thomas; Kreisberg, Richard; Rovira, Hector; Knijnenburg, Theo; Boyle, John; Shmulevich, Ilya.

EURASIP J Bioinform Syst Biol ; 2012(1): 15, 2012 Oct 09.

Artigo em Inglês | MEDLINE | ID: mdl-23046488

RESUMO

Genomic studies are now being undertaken on thousands of samples requiring new computational tools that can rapidly analyze data to identify clinically important features. Inferring structural variations in cancer genomes from mate-paired reads is a combinatorially difficult problem. We introduce Fastbreak, a fast and scalable toolkit that enables the analysis and visualization of large amounts of data from projects such as The Cancer Genome Atlas.

Methods for visual mining of genomic and proteomic data atlases.

Boyle, John; Kreisberg, Richard; Bressler, Ryan; Killcoyne, Sarah.

BMC Bioinformatics ; 13: 58, 2012 Apr 23.

Artigo em Inglês | MEDLINE | ID: mdl-22524279

RESUMO

BACKGROUND: As the volume, complexity and diversity of the information that scientists work with on a daily basis continues to rise, so too does the requirement for new analytic software. The analytic software must solve the dichotomy that exists between the need to allow for a high level of scientific reasoning, and the requirement to have an intuitive and easy to use tool which does not require specialist, and often arduous, training to use. Information visualization provides a solution to this problem, as it allows for direct manipulation and interaction with diverse and complex data. The challenge addressing bioinformatics researches is how to apply this knowledge to data sets that are continually growing in a field that is rapidly changing. RESULTS: This paper discusses an approach to the development of visual mining tools capable of supporting the mining of massive data collections used in systems biology research, and also discusses lessons that have been learned providing tools for both local researchers and the wider community. Example tools were developed which are designed to enable the exploration and analyses of both proteomics and genomics based atlases. These atlases represent large repositories of raw and processed experiment data generated to support the identification of biomarkers through mass spectrometry (the PeptideAtlas) and the genomic characterization of cancer (The Cancer Genome Atlas). Specifically the tools are designed to allow for: the visual mining of thousands of mass spectrometry experiments, to assist in designing informed targeted protein assays; and the interactive analysis of hundreds of genomes, to explore the variations across different cancer genomes and cancer types. CONCLUSIONS: The mining of massive repositories of biological data requires the development of new tools and techniques. Visual exploration of the large-scale atlas data sets allows researchers to mine data to find new meaning and make sense at scales from single samples to entire populations. Providing linked task specific views that allow a user to start from points of interest (from diseases to single genes) enables targeted exploration of thousands of spectra and genomes. As the composition of the atlases changes, and our understanding of the biology increase, new tasks will continually arise. It is therefore important to provide the means to make the data available in a suitable manner in as short a time as possible. We have done this through the use of common visualization workflows, into which we rapidly deploy visual tools. These visualizations follow common metaphors where possible to assist users in understanding the displayed data. Rapid development of tools and task specific views allows researchers to mine large-scale data almost as quickly as it is produced. Ultimately these visual tools enable new inferences, new analyses and further refinement of the large scale data being provided in atlases such as PeptideAtlas and The Cancer Genome Atlas.

Assuntos

Mineração de Dados , Genômica/métodos , Neoplasias/genética , Proteômica/métodos , Software , Neoplasias do Colo/genética , Feminino , Glioblastoma/genética , Humanos , Espectrometria de Massas , Neoplasias Ovarianas/genética

SAMQA: error classification and validation of high-throughput sequenced read data.

Robinson, Thomas; Killcoyne, Sarah; Bressler, Ryan; Boyle, John.

BMC Genomics ; 12: 419, 2011 Aug 18.

Artigo em Inglês | MEDLINE | ID: mdl-21851633

RESUMO

BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data. RESULTS: SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server. CONCLUSIONS: The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Projetos de Pesquisa , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Controle de Qualidade

SEQADAPT: an adaptable system for the tracking, storage and analysis of high throughput sequencing experiments.

Burdick, David B; Cavnor, Chris C; Handcock, Jeremy; Killcoyne, Sarah; Lin, Jake; Marzolf, Bruz; Ramsey, Stephen A; Rovira, Hector; Bressler, Ryan; Shmulevich, Ilya; Boyle, John.

BMC Bioinformatics ; 11: 377, 2010 Jul 14.

Artigo em Inglês | MEDLINE | ID: mdl-20630057

RESUMO

BACKGROUND: High throughput sequencing has become an increasingly important tool for biological research. However, the existing software systems for managing and processing these data have not provided the flexible infrastructure that research requires. RESULTS: Existing software solutions provide static and well-established algorithms in a restrictive package. However as high throughput sequencing is a rapidly evolving field, such static approaches lack the ability to readily adopt the latest advances and techniques which are often required by researchers. We have used a loosely coupled, service-oriented infrastructure to develop SeqAdapt. This system streamlines data management and allows for rapid integration of novel algorithms. Our approach also allows computational biologists to focus on developing and applying new methods instead of writing boilerplate infrastructure code. CONCLUSION: The system is based around the Addama service architecture and is available at our website as a demonstration web application, an installable single download and as a collection of individual customizable services.

Assuntos

Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Sequência de Bases , Sistemas de Gerenciamento de Base de Dados , Internet , Análise de Sequência de DNA/instrumentação

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA