Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Proc ACM SIGMOD Int Conf Manag Data ; 2020: 1951-1966, 2020 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-33132489

RESUMO

Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

2.
Proc Int Conf Data Eng ; 2019: 184-195, 2019 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-31595143

RESUMO

Data provenance tools capture the steps used to produce analyses. However, scientists must choose among work-flow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks.

3.
Artigo em Inglês | MEDLINE | ID: mdl-31645838

RESUMO

Distributed architectures for efficient processing of streaming data are increasingly critical to modern information processing systems. The goal of this paper is to develop type-based programming abstractions that facilitate correct and efficient deployment of a logical specification of the desired computation on such architectures. In the proposed model, each communication link has an associated type specifying tagged data items along with a dependency relation over tags that captures the logical partial ordering constraints over data items. The semantics of a (distributed) stream processing system is then a function from input data traces to output data traces, where a data trace is an equivalence class of sequences of data items induced by the dependency relation. This data-trace transduction model generalizes both acyclic synchronous data-flow and relational query processors, and can specify computations over data streams with a rich variety of partial ordering and synchronization characteristics. We then describe a set of programming templates for data-trace transductions: abstractions corresponding to common stream processing tasks. Our system automatically maps these high-level programs to a given topology on the distributed implementation platform Apache Storm while preserving the semantics. Our experimental evaluation shows that (1) while automatic parallelization deployed by existing systems may not preserve semantics, particularly when the computation is sensitive to the ordering of data items, our programming abstractions allow a natural specification of the query that contains a mix of ordering constraints while guaranteeing correct deployment, and (2) the throughput of the automatically compiled distributed code is comparable to that of hand-crafted distributed implementations.

5.
Artigo em Inglês | MEDLINE | ID: mdl-29151821

RESUMO

Real-time decision making in emerging IoT applications typically relies on computing quantitative summaries of large data streams in an efficient and incremental manner. To simplify the task of programming the desired logic, we propose StreamQRE, which provides natural and high-level constructs for processing streaming data. Our language has a novel integration of linguistic constructs from two distinct programming paradigms: streaming extensions of relational query languages and quantitative extensions of regular expressions. The former allows the programmer to employ relational constructs to partition the input data by keys and to integrate data streams from different sources, while the latter can be used to exploit the logical hierarchy in the input stream for modular specifications. We first present the core language with a small set of combinators, formal semantics, and a decidable type system. We then show how to express a number of common patterns with illustrative examples. Our compilation algorithm translates the high-level query into a streaming algorithm with precise complexity bounds on per-item processing time and total memory footprint. We also show how to integrate approximation algorithms into our framework. We report on an implementation in Java, and evaluate it with respect to existing high-performance engines for processing streaming data. Our experimental evaluation shows that (1) StreamQRE allows more natural and succinct specification of queries compared to existing frameworks, (2) the throughput of our implementation is higher than comparable systems (for example, two-to-four times greater than RxJava), and (3) the approximation algorithms supported by our implementation can lead to substantial memory savings.

6.
Big Data ; 5(3): 177-188, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-28816500

RESUMO

Significant research challenges must be addressed in the cleaning, transformation, integration, modeling, and analytics of Big Data sources for finance. This article surveys the progress made so far in this direction and obstacles yet to be overcome. These are issues that are of interest to data-driven financial institutions in both corporate finance and consumer finance. These challenges are also of interest to the legal profession as well as to regulators. The discussion is relevant to technology firms that support the growing field of FinTech.


Assuntos
Administração Financeira , Modelos Estatísticos , Tomada de Decisões
8.
Neuron ; 92(3): 617-621, 2016 Nov 02.
Artigo em Inglês | MEDLINE | ID: mdl-27810004

RESUMO

As the pace and complexity of neuroscience data grow, an open data ecosystem must develop and grow with it to allow neuroscientists the ability to reach for new heights of discovery. However, the problems and complexities of neuroscience data sharing must first be addressed. Among the challenges facing data sharing in neuroscience, the problem of incentives, discoverability, and sustainability may be the most pressing. We here describe these problems and provide potential future solutions to help cultivate an ecosystem for data sharing.


Assuntos
Disseminação de Informação/métodos , Neurociências/tendências , Humanos
9.
Proc ACM SIGMOD Int Conf Manag Data ; 2016: 1705-1720, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-28659658

RESUMO

As declarative query processing techniques expand to the Web, data streams, network routers, and cloud platforms, there is an increasing need to re-plan execution in the presence of unanticipated performance changes. New runtime information may affect which query plan we prefer to run. Adaptive techniques require innovation both in terms of the algorithms used to estimate costs, and in terms of the search algorithm that finds the best plan. We investigate how to build a cost-based optimizer that recomputes the optimal plan incrementally given new cost information, much as a stream engine constantly updates its outputs given new data. Our implementation especially shows benefits for stream processing workloads. It lays the foundations upon which a variety of novel adaptive optimization algorithms can be built. We start by leveraging the recently proposed approach of formulating query plan enumeration as a set of recursive datalog queries; we develop a variety of novel optimization approaches to ensure effective pruning in both static and incremental cases. We further show that the lessons learned in the declarative implementation can be equally applied to more traditional optimizer implementations.

10.
J Clin Neurophysiol ; 32(3): 235-9, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-26035676

RESUMO

Technological advances are dramatically advancing translational research in Epilepsy. Neurophysiology, imaging, and metadata are now recorded digitally in most centers, enabling quantitative analysis. Basic and translational research opportunities to use these data are exploding, but academic and funding cultures prevent this potential from being realized. Research on epileptogenic networks, antiepileptic devices, and biomarkers could progress rapidly if collaborative efforts to digest this "big neuro data" could be organized. Higher temporal and spatial resolution data are driving the need for novel multidimensional visualization and analysis tools. Crowd-sourced science, the same that drives innovation in computer science, could easily be mobilized for these tasks, were it not for competition for funding, attribution, and lack of standard data formats and platforms. As these efforts mature, there is a great opportunity to advance Epilepsy research through data sharing and increase collaboration between the international research community.


Assuntos
Epilepsia , Disseminação de Informação , Pesquisa Translacional Biomédica/tendências , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...