Search | VHL Regional Portal

Decibel: The Relational Dataset Branching System.

Maddox, Michael; Goehring, David; Elmore, Aaron J; Madden, Samuel; Parameswaran, Aditya; Deshpande, Amol.

Proceedings VLDB Endowment ; 9(9): 624-635, 2016 May.

Article in English | MEDLINE | ID: mdl-28149668

ABSTRACT

As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.

Collaborative Data Analytics with DataHub.

Bhardwaj, Anant; Karger, David; Subramanyam, Harihar; Deshpande, Amol; Madden, Sam; Wu, Eugene; Elmore, Aaron; Parameswaran, Aditya; Zhang, Rebecca.

Proceedings VLDB Endowment ; 8(12): 1916-1919, 2015 Aug.

Article in English | MEDLINE | ID: mdl-26844007

ABSTRACT

While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native versioning capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various data-processing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook - an IPython-based notebook for analyzing data and storing the results of data analysis.

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL