Search | VHL Regional Portal

Predictive Big Data Analytics: A Study of Parkinson's Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations.

Dinov, Ivo D; Heavner, Ben; Tang, Ming; Glusman, Gustavo; Chard, Kyle; Darcy, Mike; Madduri, Ravi; Pa, Judy; Spino, Cathie; Kesselman, Carl; Foster, Ian; Deutsch, Eric W; Price, Nathan D; Van Horn, John D; Ames, Joseph; Clark, Kristi; Hood, Leroy; Hampstead, Benjamin M; Dauer, William; Toga, Arthur W.

PLoS One ; 11(8): e0157077, 2016.

Article in English | MEDLINE | ID: mdl-27494614

ABSTRACT

BACKGROUND: A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Parkinson's Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson's disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data-large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources-all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. METHODS AND FINDINGS: Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson's disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. CONCLUSIONS: Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson's disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer's, Huntington's, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.

Subject(s)

Databases, Factual , Parkinson Disease/diagnosis , Aged , Disease Progression , Female , Humans , Logistic Models , Male , Neuroimaging , Parkinson Disease/genetics , Parkinson Disease/pathology , Support Vector Machine

Big biomedical data as the key resource for discovery science.

Toga, Arthur W; Foster, Ian; Kesselman, Carl; Madduri, Ravi; Chard, Kyle; Deutsch, Eric W; Price, Nathan D; Glusman, Gustavo; Heavner, Benjamin D; Dinov, Ivo D; Ames, Joseph; Van Horn, John; Kramer, Roger; Hood, Leroy.

J Am Med Inform Assoc ; 22(6): 1126-31, 2015 Nov.

Article in English | MEDLINE | ID: mdl-26198305

ABSTRACT

Modern biomedical data collection is generating exponentially more data in a multitude of formats. This flood of complex data poses significant opportunities to discover and understand the critical interplay among such diverse domains as genomics, proteomics, metabolomics, and phenomics, including imaging, biometrics, and clinical data. The Big Data for Discovery Science Center is taking an "-ome to home" approach to discover linkages between these disparate data sources by mining existing databases of proteomic and genomic data, brain images, and clinical assessments. In support of this work, the authors developed new technological capabilities that make it easy for researchers to manage, aggregate, manipulate, integrate, and model large amounts of distributed data. Guided by biological domain expertise, the Center's computational resources and software will reveal relationships and patterns, aiding researchers in identifying biomarkers for the most confounding conditions and diseases, such as Parkinson's and Alzheimer's.

Subject(s)

Biomedical Research , Datasets as Topic , Neurosciences , Humans , National Institutes of Health (U.S.) , Translational Research, Biomedical , United States

Next generation sequence analysis and computational genomics using graphical pipeline workflows.

Torri, Federica; Dinov, Ivo D; Zamanyan, Alen; Hobel, Sam; Genco, Alex; Petrosyan, Petros; Clark, Andrew P; Liu, Zhizhong; Eggert, Paul; Pierce, Jonathan; Knowles, James A; Ames, Joseph; Kesselman, Carl; Toga, Arthur W; Potkin, Steven G; Vawter, Marquis P; Macciardi, Fabio.

Genes (Basel) ; 3(3): 545-75, 2012 Aug 30.

Article in English | MEDLINE | ID: mdl-23139896

ABSTRACT

Whole-genome and exome sequencing have already proven to be essential and powerful methods to identify genes responsible for simple Mendelian inherited disorders. These methods can be applied to complex disorders as well, and have been adopted as one of the current mainstream approaches in population genetics. These achievements have been made possible by next generation sequencing (NGS) technologies, which require substantial bioinformatics resources to analyze the dense and complex sequence data. The huge analytical burden of data from genome sequencing might be seen as a bottleneck slowing the publication of NGS papers at this time, especially in psychiatric genetics. We review the existing methods for processing NGS data, to place into context the rationale for the design of a computational resource. We describe our method, the Graphical Pipeline for Computational Genomics (GPCG), to perform the computational steps required to analyze NGS data. The GPCG implements flexible workflows for basic sequence alignment, sequence data quality control, single nucleotide polymorphism analysis, copy number variant identification, annotation, and visualization of results. These workflows cover all the analytical steps required for NGS data, from processing the raw reads to variant calling and annotation. The current version of the pipeline is freely available at http://pipeline.loni.ucla.edu. These applications of NGS analysis may gain clinical utility in the near future (e.g., identifying miRNA signatures in diseases) when the bioinformatics approach is made feasible. Taken together, the annotation tools and strategies that have been developed to retrieve information and test hypotheses about the functional role of variants present in the human genome will help to pinpoint the genetic risk factors for psychiatric disorders.

A system architecture for sharing de-identified, research-ready brain scans and health information across clinical imaging centers.

Chervenak, Ann L; van Erp, Theo G M; Kesselman, Carl; D'Arcy, Mike; Sobell, Janet; Keator, David; Dahm, Lisa; Murry, Jim; Law, Meng; Hasso, Anton; Ames, Joseph; Macciardi, Fabio; Potkin, Steven G.

Stud Health Technol Inform ; 175: 19-28, 2012.

Article in English | MEDLINE | ID: mdl-22941984

ABSTRACT

Progress in our understanding of brain disorders increasingly relies on the costly collection of large standardized brain magnetic resonance imaging (MRI) data sets. Moreover, the clinical interpretation of brain scans benefits from compare and contrast analyses of scans from patients with similar, and sometimes rare, demographic, diagnostic, and treatment status. A solution to both needs is to acquire standardized, research-ready clinical brain scans and to build the information technology infrastructure to share such scans, along with other pertinent information, across hospitals. This paper describes the design, deployment, and operation of a federated imaging system that captures and shares standardized, de-identified clinical brain images in a federation across multiple institutions. In addition to describing innovative aspects of the system architecture and our initial testing of the deployed infrastructure, we also describe the Standardized Imaging Protocol (SIP) developed for the project and our interactions with the Institutional Review Board (IRB) regarding handling patient data in the federated environment.

Subject(s)

Brain Diseases/pathology , Brain/pathology , Information Dissemination/methods , Information Storage and Retrieval/methods , Internet , Medical Informatics/methods , Radiology Information Systems/organization & administration , Humans

Applications of the pipeline environment for visual informatics and genomics computations.

Dinov, Ivo D; Torri, Federica; Macciardi, Fabio; Petrosyan, Petros; Liu, Zhizhong; Zamanyan, Alen; Eggert, Paul; Pierce, Jonathan; Genco, Alex; Knowles, James A; Clark, Andrew P; Van Horn, John D; Ames, Joseph; Kesselman, Carl; Toga, Arthur W.

BMC Bioinformatics ; 12: 304, 2011 Jul 26.

Article in English | MEDLINE | ID: mdl-21791102

ABSTRACT

BACKGROUND: Contemporary informatics and genomics research require efficient, flexible and robust management of large heterogeneous data, advanced computational tools, powerful visualization, reliable hardware infrastructure, interoperability of computational resources, and detailed data and analysis-protocol provenance. The Pipeline is a client-server distributed computational environment that facilitates the visual graphical construction, execution, monitoring, validation and dissemination of advanced data analysis protocols. RESULTS: This paper reports on the applications of the LONI Pipeline environment to address two informatics challenges - graphical management of diverse genomics tools, and the interoperability of informatics software. Specifically, this manuscript presents the concrete details of deploying general informatics suites and individual software tools to new hardware infrastructures, the design, validation and execution of new visual analysis protocols via the Pipeline graphical interface, and integration of diverse informatics tools via the Pipeline eXtensible Markup Language syntax. We demonstrate each of these processes using several established informatics packages (e.g., miBLAST, EMBOSS, mrFAST, GWASS, MAQ, SAMtools, Bowtie) for basic local sequence alignment and search, molecular biology data analysis, and genome-wide association studies. These examples demonstrate the power of the Pipeline graphical workflow environment to enable integration of bioinformatics resources which provide a well-defined syntax for dynamic specification of the input/output parameters and the run-time execution controls. CONCLUSIONS: The LONI Pipeline environment http://pipeline.loni.ucla.edu provides a flexible graphical infrastructure for efficient biomedical computing and distributed informatics research. The interactive Pipeline resource manager enables the utilization and interoperability of diverse types of informatics resources. The Pipeline client-server model provides computational power to a broad spectrum of informatics investigators--experienced developers and novice users, user with or without access to advanced computational-resources (e.g., Grid, data), as well as basic and translational scientists. The open development, validation and dissemination of computational networks (pipeline workflows) facilitates the sharing of knowledge, tools, protocols and best practices, and enables the unbiased validation and replication of scientific findings by the entire community.

Subject(s)

Genomics/methods , Informatics/methods , Software , Computational Biology/methods , Medical Informatics Applications

Enabling collaborative research using the Biomedical Informatics Research Network (BIRN).

Helmer, Karl G; Ambite, Jose Luis; Ames, Joseph; Ananthakrishnan, Rachana; Burns, Gully; Chervenak, Ann L; Foster, Ian; Liming, Lee; Keator, David; Macciardi, Fabio; Madduri, Ravi; Navarro, John-Paul; Potkin, Steven; Rosen, Bruce; Ruffins, Seth; Schuler, Robert; Turner, Jessica A; Toga, Arthur; Williams, Christina; Kesselman, Carl.

J Am Med Inform Assoc ; 18(4): 416-22, 2011.

Article in English | MEDLINE | ID: mdl-21515543

ABSTRACT

OBJECTIVE: As biomedical technology becomes increasingly sophisticated, researchers can probe ever more subtle effects with the added requirement that the investigation of small effects often requires the acquisition of large amounts of data. In biomedicine, these data are often acquired at, and later shared between, multiple sites. There are both technological and sociological hurdles to be overcome for data to be passed between researchers and later made accessible to the larger scientific community. The goal of the Biomedical Informatics Research Network (BIRN) is to address the challenges inherent in biomedical data sharing. MATERIALS AND METHODS: BIRN tools are grouped into 'capabilities' and are available in the areas of data management, data security, information integration, and knowledge engineering. BIRN has a user-driven focus and employs a layered architectural approach that promotes reuse of infrastructure. BIRN tools are designed to be modular and therefore can work with pre-existing tools. BIRN users can choose the capabilities most useful for their application, while not having to ensure that their project conforms to a monolithic architecture. RESULTS: BIRN has implemented a new software-based data-sharing infrastructure that has been put to use in many different domains within biomedicine. BIRN is actively involved in outreach to the broader biomedical community to form working partnerships. CONCLUSION: BIRN's mission is to provide capabilities and services related to data sharing to the biomedical research community. It does this by forming partnerships and solving specific, user-driven problems whose solutions are then available for use by other groups.

Subject(s)

Biomedical Research , Biotechnology , Computer Communication Networks , Information Dissemination , Biomedical Research/organization & administration , Computer Communication Networks/organization & administration , Computer Security , Computer Systems , Database Management Systems , Humans , Information Storage and Retrieval , Systems Integration , United States

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL